NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)

SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) LSI meets TREC: A Status Report chapter S. Dumais National Institute of Standards and Technology Donna K. Harman 2.3 Retrieval 2.3.1 Query processing Queries were automatically processed in the same way as documents. For queries derived from the topic statement, we began with the full text of each topic (all topic fields), and stripped out the SGML field identifiers. For feedback queries, we used the full text of relevant documents. We did not use: stemming, phrases, syntactic or semantic parsing, word sense disambiguation, heuristic association, spelling checking or correction, proper noun identification, complex tokenizers, a controlled vocabulary, a thesaurus, or any manual indexing. Note also that we did not use any Boolean connectors or proximity operators in query formulation. The implicit connectives, as in ordinary vector methods, fall somewhere between ORs and ANDs, but with an additional kind of "fuzziness" introduced by the dimension-reduced association matrix representation of terms and documents. 2.3.2 Adhoc queries: The topic statements were automatically processed as described above to generate a list of query terms and their frequencies. This histogram of query terms was used to form a "query vector". A query vector was the weighted vector sum of its constituent term vectors. A separate query vector was created for matching against each of 9 databases (DOE 1, WSJl, APl, FRi, ZIFFi, WSJ2, AP2, FR2, ZIFF2). For each subcollection, all query terms occurring in that database and their term weights were used. For example, a DOE 1 query vector was created using the term weights and term vectors from the DOEl database, and it was compared against the 226k documents in DOEl. This procedure was repeated for the remaining 8 collections, resulting in a total of 746k similarities between the adhoc queries and the 746k documents in the full TREC collection. (Note that we always started with the same query terms. However, these terms usually had somewhat different weights in the different collections, and some terms were not present in some subcollections.) We used 235 dimensions and a cosine similarity measure for all collections. We submifled results from two sets of adhoc queries. The two sets of adhoc results differed only in how the information from the 9 subcollections was combined to arrive at a single ranking. In one case, we combined the information from the 9 databases by simply taking the raw cosines from the different collections and ranking them from largest to smallest. We call these the adhoc[OCRerr]topic cosine results. In another case, we normalized the cosines within each subcollection before combining. That is, within each subcollection, we transformed the cosines to z-scores so that they had a mean of 0 and a standard deviation of 1. We then combined across collections using these normalized z-scores rather than the raw cosines, again ranking from largest to smallest. We call these the adhoc_topic normalized_cosine results. This method of normalizing scores offers somewhat more flexibility in combining information from many subcollections. It means, for example, that different numbers of dimensions or similarity measures could be used in the different subcollections, but combined on the basis of a comparable score. In the TREC experiments, all comparisons used the same number of dimensions, so this normalization will equate for real differences in the data rather than statistical artifacts of the analysis. 142