SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
LSI meets TREC: A Status Report
chapter
S. Dumais
National Institute of Standards and Technology
Donna K. Harman
2.3 Retrieval
2.3.1 Query processing
Queries were automatically processed in the same way as documents. For queries derived from
the topic statement, we began with the full text of each topic (all topic fields), and stripped out
the SGML field identifiers. For feedback queries, we used the full text of relevant documents.
We did not use: stemming, phrases, syntactic or semantic parsing, word sense disambiguation,
heuristic association, spelling checking or correction, proper noun identification, complex
tokenizers, a controlled vocabulary, a thesaurus, or any manual indexing. Note also that we did
not use any Boolean connectors or proximity operators in query formulation. The implicit
connectives, as in ordinary vector methods, fall somewhere between ORs and ANDs, but with an
additional kind of "fuzziness" introduced by the dimension-reduced association matrix
representation of terms and documents.
2.3.2 Adhoc queries:
The topic statements were automatically processed as described above to generate a list of query
terms and their frequencies. This histogram of query terms was used to form a "query vector".
A query vector was the weighted vector sum of its constituent term vectors. A separate query
vector was created for matching against each of 9 databases (DOE 1, WSJl, APl, FRi, ZIFFi,
WSJ2, AP2, FR2, ZIFF2). For each subcollection, all query terms occurring in that database and
their term weights were used. For example, a DOE 1 query vector was created using the term
weights and term vectors from the DOEl database, and it was compared against the 226k
documents in DOEl. This procedure was repeated for the remaining 8 collections, resulting in a
total of 746k similarities between the adhoc queries and the 746k documents in the full TREC
collection. (Note that we always started with the same query terms. However, these terms
usually had somewhat different weights in the different collections, and some terms were not
present in some subcollections.) We used 235 dimensions and a cosine similarity measure for all
collections.
We submifled results from two sets of adhoc queries. The two sets of adhoc results differed only
in how the information from the 9 subcollections was combined to arrive at a single ranking. In
one case, we combined the information from the 9 databases by simply taking the raw cosines
from the different collections and ranking them from largest to smallest. We call these the
adhoc[OCRerr]topic cosine results. In another case, we normalized the cosines within each
subcollection before combining. That is, within each subcollection, we transformed the cosines
to z-scores so that they had a mean of 0 and a standard deviation of 1. We then combined across
collections using these normalized z-scores rather than the raw cosines, again ranking from
largest to smallest. We call these the adhoc_topic normalized_cosine results. This method of
normalizing scores offers somewhat more flexibility in combining information from many
subcollections. It means, for example, that different numbers of dimensions or similarity
measures could be used in the different subcollections, but combined on the basis of a
comparable score. In the TREC experiments, all comparisons used the same number of
dimensions, so this normalization will equate for real differences in the data rather than
statistical artifacts of the analysis.
142