SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Okapi at TREC-2
chapter
S. Robertson
S. Walker
S. Jones
M. Hancock-Beaulieu
M. Gatford
National Institute of Standards and Technology
D. K. Harman
Extracting new terms
I tried to get at least six relevant documents for the extrac-
tion phase, and usually managed a few more. As already
noted, sets generated by term extraction contain only sin-
gle words, so before looking at the new records I sometimes
added in a few phrases to this set, either important ones from
the original query or others which had occurred in relevant
documents. The extracted sets of terms tended to be larger
than the original query and certainly included items which
a human searcher (at least one unfamiliar with this genre of
literature) would not have thought of. It was amusing, for
instance, to see "topdrawer" and "topnotch" (epithets for
companies) extracted from documents about investment in
biotechnology, and "leftist" (an invariable collocate for San-
danista) pulled out of documents about Nicaraguan peace
talks. Some material for socio-linguistic analysis here!
My impression ... is that where the original document set
from which terms were extracted was fairly coherent, the de-
rived set [from query expansion] also had a high proportion
of relevant documents. Not surprisingly, where I had scraped
the barrel and tried several different routes to a few relevant
documents, extraction produced equally miscellaneous and
disappointing results.
Normally I went through two or three cycles of selec-
tion/extraction, but looking at fewer records each time. The
set of extracted terms did not seem to change materially
from one cycle to the next, and I would have expected the
final result file reflected the query quite well even though the
phrases had been lost.
Conclusion
In spite of the frustrations of this exercise, I found it a more
interesting retrieval task than normal bibliographic search-
ing, mainly because it was possible to see the full documents
to gauge the success of the query, and use a broader range of
natural-language skills to dream up potentially useful search
terms.
Acknowledgments
We are most grateful to the British Library Research
& Development Department and to DARPA/NIST for
their financial support of this work. Our advisers have
been unstintingly helpful. We blame the system, not our
panel of searchers, for the poor results of the interactive
trial. Above all, we wish to thank Donna Harman of
NIST for her outstandingly efficient and courteous or-
ganisation and management of the TREC projects.
33