SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
TIPSTER Panel -- HNC's MatchPlus System
chapter
S. Gallant
R. Hecht-Nielson
W. Caid
K. Qing
J. Carleton
D. Sudbeck
National Institute of Standards and Technology
Donna K. Harman
Context Vector
for Document
V
V = IIVII
[OCRerr]¾[OCRerr]z1¾[OCRerr]
..e [OCRerr] x x
A
[OCRerr]Log( counNt(w )
of
human evenil;;)
Normalize
Vector Sum
Inverse
Frequency
Weighting
Context Vector for
Individual Word[OCRerr][OCRerr]
Sloplist
Document
Figure 2: Generating the context vector for a docunient.
whether the fully-automated method can perform as
well as standard bootstrapping.
2.3 Context Vectors for Documents
Once we have generated context vectors for stems it
is easy to compute the context vector for a document.
We simply take a weighted sum of context vectors for
all stems appearing in the document2 and then nor-
malize the sum. See figure 2. This procedure applies
to documents in the training corpus as well as to new
documents. When adding up stem context vectors
we can use term frequency weights similar to conven-
tional IR systems.
2.4 Context Vectors for Queries; Rel-
evance Feedback
Query context vectors are formed similarly to docu-
ment context vectors. For each stem in the query we
2Stopwords are discarded
109
can apply a user-specified weight (default 1.0). Then
we can sum the corresponding context vectors and
normalize the result.
Note that it is easy to implement relevance feed-
back. The user can specify documents (with weights)
and the document context vectors are merely added
in with the context vectors from the other terms. We
can also find documents close to a given document by
using the document context vector as a query context
vector.
2.5 Retrieval
The basic retrieval operation is simple; we find the
document context vector closest to the query context
vector and return it. There are several important
points to note.
1. As many documents as desired may be retrieved,
and the distances from the query context vector
give some measure of retrieval quality.