NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)

SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Automatic Routing and Ad-hoc Retrieval Using SMART: TREC 2 chapter C. Buckley J. Allan G. Salton National Institute of Standards and Technology D. K. Harman term vector as follows: S(D[OCRerr],Q5) = [OCRerr](d[OCRerr]k*q[OCRerr]k) k=1 Thus, the similarity between two texts (whether query or document) depends on the weights of coinciding terms in the two vectors. Information retrieval and text linking sys- tems based on the use of global text similar- ity measures such as that of expression (2) will be successful when the common terms in the two vectors are in fact used in seman- tically similar ways. In many cases it may happen that highly-weighted terms that con- tribute substantially to the text similarity are semantically distinct. For example, a sound may be an audible phenomenon, or a body of water. TREC 1 ([1]) demonstrated that local con- texts could be used to disambiguate word senses, for example rejecting documents about industrial salts'' when given a query about the "SALT peace treaty". Overall, however, the improvement in effectiveness due to local matching was minimal in TREC 1. One reason for this is the richness of the TREC queries. Global text matching is almost invariably suf- ficient for disambiguation. Another reason is the homogeneity of the queries. They deal pri- marily with two subjects: finance, and science and technology. Within a single subject area vocabulary is more standardized and ambigu- ity is therefore minimized. One other potential reason for the unexpect- edly slight improvement is that most of the in- formation from local matches is simply being thrown away. Local matches are used as a fil- ter to reject documents that do not satisfy a local criteria: the overall global similarity used for ranking is changed only by the addition of a constant indicating the local match criteria was satisfied. The positive information that a long document might have a single paragraph which very closely matched the query is ig- nored. For TREC 2, we look at combining global and local similarities into a single final simi- larity to be used for ranking purposes. The other focus of our TREC 2 work is tak- ing advantage of the vast quantity of relevance judgements available for the routing experi- ments. In TREC 1, the relevance information was fragmentary and even occasionally incor- rect. It was hard to use this information in a reasonable fashion. Happily, the results of the (2) TREC 1 experiments furnished a large num- ber of very good relevance judgements to be used for TREC 2. Conventional vector-space feedback methods of query expansion and re- weighting are tuned for the TREC environ- ment in the routing portion of TREC 2. System Description The Cornell TREC experiments use the SMART Information Retrieval System, Ver- sion 11, and are run on a dedicated Sun Sparc 2 with 64 Mbytes of memory and 5 Gbytes of local disk. SMART Version 11 is the latest in a long line of experimental information retrieval sys- tems, dating back over 30 years, developed un- der the guidance of G. Salton. Version 11 is a reasonably complete re-write of earlier ver- sions, and was designed and implemented pri- marily by C. Buckley. The new version is ap- proximately 44,000 lines of C code and docu- mentation. SMART Version 11 offers a basic frame- work for investigations into the vector space and related models of information retrieval. Documents are fully automatically indexed, with each document representation being a weighted vector of concepts, the weight indi- cating the importance of a concept to that par- ticular document (as described above). The document representatives are stored on disk as an inverted file. Natural language queries un- dergo the same indexing process. The query representative vector is then compared with the indexed document representatives to ar- rive at a similarity (equation (2)), and the doc- uments are then fully ranked by similarity. Ad-hoc Results Cornell submitted two runs in the ad-hoc cat[OCRerr] gory. The first, crulV2, is a very simple vector comparison. The second, crnlL2, makes use of simplified least squares analysis and a train- ing set to combine global similarity and part- wise similarities in a meaningful ratio. Both systems performed at or above the median in almost all queries, as can be seen in in Table 1. The crnlV2-b run is the same as the official crnlV2 run, but with an error in the experi- mental procedure corrected (discussed below). 47