SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) TIPSTER Panel -- HNC's MatchPlus System chapter S. Gallant R. Hecht-Nielson W. Caid K. Qing J. Carleton D. Sudbeck National Institute of Standards and Technology Donna K. Harman Context Vector for Document V V = IIVII [OCRerr]¾[OCRerr]z1¾[OCRerr] ..e [OCRerr] x x A [OCRerr]Log( counNt(w ) of human evenil;;) Normalize Vector Sum Inverse Frequency Weighting Context Vector for Individual Word[OCRerr][OCRerr] Sloplist Document Figure 2: Generating the context vector for a docunient. whether the fully-automated method can perform as well as standard bootstrapping. 2.3 Context Vectors for Documents Once we have generated context vectors for stems it is easy to compute the context vector for a document. We simply take a weighted sum of context vectors for all stems appearing in the document2 and then nor- malize the sum. See figure 2. This procedure applies to documents in the training corpus as well as to new documents. When adding up stem context vectors we can use term frequency weights similar to conven- tional IR systems. 2.4 Context Vectors for Queries; Rel- evance Feedback Query context vectors are formed similarly to docu- ment context vectors. For each stem in the query we 2Stopwords are discarded 109 can apply a user-specified weight (default 1.0). Then we can sum the corresponding context vectors and normalize the result. Note that it is easy to implement relevance feed- back. The user can specify documents (with weights) and the document context vectors are merely added in with the context vectors from the other terms. We can also find documents close to a given document by using the document context vector as a query context vector. 2.5 Retrieval The basic retrieval operation is simple; we find the document context vector closest to the query context vector and return it. There are several important points to note. 1. As many documents as desired may be retrieved, and the distances from the query context vector give some measure of retrieval quality.