NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)

SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Recent Developments in Natural Language Text Retrieval chapter T. Strzalkowski J. Carballo National Institute of Standards and Technology D. K. Harman access methods. Efficiency considerations has led us to investigate an alternative approach to the hot spot retrieval which would not require re-indexing of the existing database or any changes in document access. In our approach, tile maximum number of terms CU which a query is permitted to match a document is limited to N highest weight terms, where N can be the same for all queries of may vary from one query to another. Note that this is not the same as simply tating the N top terms from each query. Rather, for each document for which there are M match- ing terms with the query, only min(M,N) of them, namely those which have highest weights, will be con- sidered when computing the document score. Moreover, only the global importance weights for terms are con- sidered (such as icif), while local in-document frequency (eg., ti) is suppressed by either taiting a log or replacing it with a constant. The effect of this `hot spot' retrieval is shown below in the ranking of relevant documents within the top 1000 retrieved documents for topic 65: Full lf.idf retneval DOCUMENT ID RANK SCORE WSJ870304-0091 4 12228 WSJ891017[OCRerr]O156 7 9771 W8J920226-0034 14 8921 W8J870429-0078 26 7570 WSJ870205-0078 33 6972 WSJ8807124)()33 34 6834 WSJ9201 16-0002 37 6580 WSJ910328A)()13 74 4872 WSJ910830-0140 80 4701 WSJ8908044)138 102 4134 WSJ91 1212-0022 104 4065 WSJ870825-0026 113 3922 W8J850712-0023 135 3654 WSJ871202-0145 153 3519 Hot-spot idf-dominated with N=20 DOCUMENT ID RANK SCORE W8J920226-0034 1 11955 WSJ870304-0091 3 11565 W5J870429-0078 5 9997 WSJ9201 16-0002 7 9997 WsJ910830[OCRerr]140 11 8792 WSJ870205-0078 20 8402 WSJ910328-0013 29 8402 WSJ880712[OCRerr]OO33 71 6834 WSJ880712-0023 72 6834 W8i891017-0156 87 6834 135 WSJ890804-0138 92 6834 WSJ91 1212-0022 111 6834 WSJ871202-0145 124 6834 The final ranking is obtained by merging the two rankings by score. While some of the recall may be sacrificed (`hot spot' retrieval has, understandably, lower recall than flill query retrieval, and this becomes the lower bound on recall for the combined ranxing) the combined ranl[OCRerr]g precision has been consistenfly better than in either of the original rankings: an average improvement is 10-12% above the tf.idf run precision (which is often stronger of the two). CONCLUSIONS We presented in some detail our natural language information retrieval system consisting of an advanced NLP module and a `pure' statistical core engine. While many problems remain to be resolved, including the question of adequacy of term-based representation of document content, we attempted to demonstrate that the architecture described here is nonetheless viable. In par- ticular, we demonstrated that natural language processing can now be done on a fairly large scale and that its speed and robustness can match those of traditional statistical programs such as key-word indexing or statistical phrase extraction. We suggest, with some caution until more experiments are run, that natural language processing can be very effective in creating appropriate search queries out of user's initial specifications which can be fre- quenfly imprecise or vague. On the other hand, we must be aware of the Innits of NLP technologies at our disposal. While part-of- speech tagging, lexicon-based stemming, and parsing can be done on large amounts of text (hundreds of millions of words and more), other, more advanced processing involving conceptual structuring, logical forms, etc., is still beyond reach, computationally. It may be assumed that these super-advanced techniques will prove even more effective, since they address the problem of representation-level limits; however the experimental evidence is sparse and necessarily limited to rather small scale tests (e.g., Mauldin, 1991). ACKNOWLEDGEMENTS We would like to thank Donna Harman of NIST for making her PRISE system available to us. We would also like to thank Ralph Weischedel and Heidi Fox of BBN for providing and assisting in the use of the part of speech tagger. This paper is based upon work supported by the Advanced Resesich Project Agency under Con- tract N00014-904-1851 from the Office of Naval Research, under Contract N006()()-88-D-3717 from PRC