NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)

SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Okapi at TREC-2 chapter S. Robertson S. Walker S. Jones M. Hancock-Beaulieu M. Gatford National Institute of Standards and Technology D. K. Harman Extracting new terms I tried to get at least six relevant documents for the extrac- tion phase, and usually managed a few more. As already noted, sets generated by term extraction contain only sin- gle words, so before looking at the new records I sometimes added in a few phrases to this set, either important ones from the original query or others which had occurred in relevant documents. The extracted sets of terms tended to be larger than the original query and certainly included items which a human searcher (at least one unfamiliar with this genre of literature) would not have thought of. It was amusing, for instance, to see "topdrawer" and "topnotch" (epithets for companies) extracted from documents about investment in biotechnology, and "leftist" (an invariable collocate for San- danista) pulled out of documents about Nicaraguan peace talks. Some material for socio-linguistic analysis here! My impression ... is that where the original document set from which terms were extracted was fairly coherent, the de- rived set [from query expansion] also had a high proportion of relevant documents. Not surprisingly, where I had scraped the barrel and tried several different routes to a few relevant documents, extraction produced equally miscellaneous and disappointing results. Normally I went through two or three cycles of selec- tion/extraction, but looking at fewer records each time. The set of extracted terms did not seem to change materially from one cycle to the next, and I would have expected the final result file reflected the query quite well even though the phrases had been lost. Conclusion In spite of the frustrations of this exercise, I found it a more interesting retrieval task than normal bibliographic search- ing, mainly because it was possible to see the full documents to gauge the success of the query, and use a broader range of natural-language skills to dream up potentially useful search terms. Acknowledgments We are most grateful to the British Library Research & Development Department and to DARPA/NIST for their financial support of this work. Our advisers have been unstintingly helpful. We blame the system, not our panel of searchers, for the poor results of the interactive trial. Above all, we wish to thank Donna Harman of NIST for her outstandingly efficient and courteous or- ganisation and management of the TREC projects. 33