NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)

SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Feedback and Mixing Experiments with MatchPlus chapter S. Gallant W. Caid J. Carleton T. Gutschow R. Hecht-Nielsen K. Qing D. Sudbeck National Institute of Standards and Technology D. K. Harman 4. Find the closest document to a given document d by treating Vd as a query vector. 5. Perform relevance feedback. If d is a relevant document for query Q,form a new query vector = [OCRerr] + aVd where [OCRerr] is some suitable positive number (eg 3). (See also [8].) Note that search with [OCRerr] takes the same amount of time as search with V'?. 2.1 Context Vector Representations Context vector representations (or feature space rep- resentations) have a long history in cognitive science. Work by Waltz & Pollack [10] had an especially strong influence on the work reported here. They described a neural network model for word sense disambiguation and developed context vector representations (which they termed micr[OCRerr]feature representations). See Gal- lant [2] for more background on context vector repre- sentations and word sense disambiguation. We use context vector representations for docu- ment retrieval, with all of the representation being learned from an unlabeled corpus. A main constraint for all of this work is to keep computation and storage reasonable, even for very large corpora. 2.2 Bootstrap Learning Bootstrapping is a machine learning technique that begins with vectors having randomly generated p05- itive and negative components, and then uses an unlabeled training corpus to modify the vectors so that similarly used terms have similar representa- tions. Previously we had used partially hand-entered components as described in [2], but we have dispensed with all hand entry in current implementations. Although there are important proprietary details, the basic idea for bootstrapping is to make a stem's vector more like its neighbors by adding a fraction of their vectors to the stem in question. We make use of a key property of high-dimensional vectors: the ability to be `similar to' a multitude of vectors. This is the same property that allows the vector sum that represents a document to be similar to individual term vector summands. (Similarity between normal- ized vectors is measured by their inner product.) Note that bootstrapping takes into account local word posi[OCRerr]ioning when assigning the context vector representation for stems. Moreover it is nearly in- variant with respect to document divisions within the training corpus. This contrasts with those methods 102 where stem representations are determined solely by those documen[OCRerr]s in which the stem lies. 2.3 Context Vectors for Documents Once we have generated context vectors for stems it is easy to compute the context vector for a document. We simply take a weighted sum of context vectors for all stems appearing in the document2 and then normalize the sum. This procedure applies to docu- ments in the training corpus as well as to new docu- ments. When adding up stem context vectors, we can use term frequency weights similar to conventional IR systems. 2.4 Context Vectors for Queries; Rel- evance Feedback Query context vectors are formed similarly to docu- ment context vectors. For each stem in the query we can apply a user-specified weight (default 1.0). Then we can sum the corresponding context vectors and normalize the result. Note that it is easy to implement traditional rel- evance feedback. The user can specify documents (with weights) and the document context vectors are merely added in with the context vectors from the other terms. We can also find documents close to a given document by using the document context vec- tor as a query context vector. 2.5 Retrieval The basic retrieval operation is simple; we find the document context vector closest to the query context vector and return it. There are several important points to note. 1. As many documents as desired may be retrieved, and the distances from the query context vector give some measure of retrieval quality. 2. Because document context vectors are normal- ized, we may simply find the document d that maximized the dot product with the query con- text vector, [OCRerr] m[OCRerr]\{Vd . d 3. It is easy to combine keyword match with con- text vectors. We first use the match as a filter for documents and return documents in order by closeness to the query vectors. If all matching 2stopwords are discarded.