NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)

SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Automatic Routing and Ad-hoc Retrieval Using SMART: TREC 2 chapter C. Buckley J. Allan G. Salton National Institute of Standards and Technology D. K. Harman Automatic Routing and Ad-hoc Retrieval Using SMART : TREC 2 Chris BuCkley*, James Allan, and Gerard Salton Abstract The Smart information retrieval project em- phasizes completely automatic approaches to the understanding and retrieval of large quan- tities of text. We continue our work in the TREC 2 environment7 performing both rout- ing and ad-hoc experiments. The ad-hoc work extends our investigations into combin- ing global similarities, giving an overall indica- tion of how a document matches a query, with local similarities identifying a smaller part of the document which matches the query. The performance of the ad-hoc runs is good, but it is clear we are not yet taking full advantage of the available local information. Our routing experiments use conventional relevance feedback approaches to routing, but with a much greater degree of query expan- sion than was done in TREC 1. The length of a query vector is increased by a factor of 5 to 10 by adding terms found in previously seen relevant documents. This approach im- proves effectiveness by 30-40% over the origi- nal query. Introduction For over 30 years, the Smart project at Cor- nell University has been interested in the anal- ysis, search, and retrieval of heterogeneous text databases, where the vocabulary is al- lowed to vary widely, and the subject matter is unrestricted. Such databases may include newspaper articles, newswire dispatches, text- books, dictionaries, encyclopedias, manuals, magazine articles, and so on. The usual text analysis and text indexing approaches that are based on the use of thesauruses and other vo- cabulary control devices are difficult to apply *Department of Computer Science, Cornell Univer- sity, Ithaca, NY 14853-7501. This study was sup- ported in part by the Nationai Science Foundation un- der grant IRI 89-15847. 45 in unrestricted text environments, because the word meanings are not stable in such circum- stances and the interpretation varies depend- mg on context. The applicability of more com- plex text analysis systems that are based on the construction of knowledge bases covering the detailed structure of particular subject ar- eas, together with inference rules designed to derive relationships between the relevant con- cepts, is even more questionable in such cases. Complete theories of knowledge representation do not exist, and it is unclear what concepts, concept relationships, and inference rules may be needed to understand particular texts.[11] Accordingly, a text analysis and retrieval component must necessarily be based primar- ily on a study of the available texts themselves. Fortunately very large text databases are now available in machine-readable form, and a sub- stantial amount of information is automati- cally derivable about the occurrence properties of words and expressions in natural-language texts, and about the contexts in which the words are used. This information can help in determining whether a query and a text are se- mantically homogeneous, that is, whether they cover similar subject areas. When that is the case, the text can be retrieved in response to the query. Automatic Indexing In the Smart system, the vector-processing model of retrieval is used to transform both the available information requests as well as the stored documents into vectors of the form: = (w[OCRerr]17w[OCRerr]27.. .7 w[OCRerr]t) where D[OCRerr] represents a document (or query) text and Wik is the weight of term Tk in doc- ument D[OCRerr]. A weight of zero is used for terms that are absent from a particular document, and positive weights characterize terms actu-