SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Recent Developments in Natural Language Text Retrieval
chapter
T. Strzalkowski
J. Carballo
National Institute of Standards and Technology
D. K. Harman
access methods.
Efficiency considerations has led us to investigate
an alternative approach to the hot spot retrieval which
would not require re-indexing of the existing database or
any changes in document access. In our approach, tile
maximum number of terms CU which a query is permitted
to match a document is limited to N highest weight
terms, where N can be the same for all queries of may
vary from one query to another. Note that this is not the
same as simply tating the N top terms from each query.
Rather, for each document for which there are M match-
ing terms with the query, only min(M,N) of them,
namely those which have highest weights, will be con-
sidered when computing the document score. Moreover,
only the global importance weights for terms are con-
sidered (such as icif), while local in-document frequency
(eg., ti) is suppressed by either taiting a log or replacing
it with a constant. The effect of this `hot spot' retrieval is
shown below in the ranking of relevant documents within
the top 1000 retrieved documents for topic 65:
Full lf.idf retneval
DOCUMENT ID RANK SCORE
WSJ870304-0091 4 12228
WSJ891017[OCRerr]O156 7 9771
W8J920226-0034 14 8921
W8J870429-0078 26 7570
WSJ870205-0078 33 6972
WSJ8807124)()33 34 6834
WSJ9201 16-0002 37 6580
WSJ910328A)()13 74 4872
WSJ910830-0140 80 4701
WSJ8908044)138 102 4134
WSJ91 1212-0022 104 4065
WSJ870825-0026 113 3922
W8J850712-0023 135 3654
WSJ871202-0145 153 3519
Hot-spot idf-dominated with N=20
DOCUMENT ID RANK SCORE
W8J920226-0034 1 11955
WSJ870304-0091 3 11565
W5J870429-0078 5 9997
WSJ9201 16-0002 7 9997
WsJ910830[OCRerr]140 11 8792
WSJ870205-0078 20 8402
WSJ910328-0013 29 8402
WSJ880712[OCRerr]OO33 71 6834
WSJ880712-0023 72 6834
W8i891017-0156 87 6834
135
WSJ890804-0138 92 6834
WSJ91 1212-0022 111 6834
WSJ871202-0145 124 6834
The final ranking is obtained by merging the two
rankings by score. While some of the recall may be
sacrificed (`hot spot' retrieval has, understandably, lower
recall than flill query retrieval, and this becomes the
lower bound on recall for the combined ranxing) the
combined ranl[OCRerr]g precision has been consistenfly better
than in either of the original rankings: an average
improvement is 10-12% above the tf.idf run precision
(which is often stronger of the two).
CONCLUSIONS
We presented in some detail our natural language
information retrieval system consisting of an advanced
NLP module and a `pure' statistical core engine. While
many problems remain to be resolved, including the
question of adequacy of term-based representation of
document content, we attempted to demonstrate that the
architecture described here is nonetheless viable. In par-
ticular, we demonstrated that natural language processing
can now be done on a fairly large scale and that its speed
and robustness can match those of traditional statistical
programs such as key-word indexing or statistical phrase
extraction. We suggest, with some caution until more
experiments are run, that natural language processing can
be very effective in creating appropriate search queries
out of user's initial specifications which can be fre-
quenfly imprecise or vague.
On the other hand, we must be aware of the Innits
of NLP technologies at our disposal. While part-of-
speech tagging, lexicon-based stemming, and parsing can
be done on large amounts of text (hundreds of millions of
words and more), other, more advanced processing
involving conceptual structuring, logical forms, etc., is
still beyond reach, computationally. It may be assumed
that these super-advanced techniques will prove even
more effective, since they address the problem of
representation-level limits; however the experimental
evidence is sparse and necessarily limited to rather small
scale tests (e.g., Mauldin, 1991).
ACKNOWLEDGEMENTS
We would like to thank Donna Harman of NIST
for making her PRISE system available to us. We would
also like to thank Ralph Weischedel and Heidi Fox of
BBN for providing and assisting in the use of the part of
speech tagger. This paper is based upon work supported
by the Advanced Resesich Project Agency under Con-
tract N00014-904-1851 from the Office of Naval
Research, under Contract N006()()-88-D-3717 from PRC