SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Feedback and Mixing Experiments with MatchPlus
chapter
S. Gallant
W. Caid
J. Carleton
T. Gutschow
R. Hecht-Nielsen
K. Qing
D. Sudbeck
National Institute of Standards and Technology
D. K. Harman
4. Find the closest document to a given document
d by treating Vd as a query vector.
5. Perform relevance feedback. If d is a relevant
document for query Q,form a new query vector
= [OCRerr] + aVd
where [OCRerr] is some suitable positive number (eg 3).
(See also [8].) Note that search with [OCRerr] takes
the same amount of time as search with V'?.
2.1 Context Vector Representations
Context vector representations (or feature space rep-
resentations) have a long history in cognitive science.
Work by Waltz & Pollack [10] had an especially strong
influence on the work reported here. They described a
neural network model for word sense disambiguation
and developed context vector representations (which
they termed micr[OCRerr]feature representations). See Gal-
lant [2] for more background on context vector repre-
sentations and word sense disambiguation.
We use context vector representations for docu-
ment retrieval, with all of the representation being
learned from an unlabeled corpus. A main constraint
for all of this work is to keep computation and storage
reasonable, even for very large corpora.
2.2 Bootstrap Learning
Bootstrapping is a machine learning technique that
begins with vectors having randomly generated p05-
itive and negative components, and then uses an
unlabeled training corpus to modify the vectors so
that similarly used terms have similar representa-
tions. Previously we had used partially hand-entered
components as described in [2], but we have dispensed
with all hand entry in current implementations.
Although there are important proprietary details,
the basic idea for bootstrapping is to make a stem's
vector more like its neighbors by adding a fraction
of their vectors to the stem in question. We make
use of a key property of high-dimensional vectors:
the ability to be `similar to' a multitude of vectors.
This is the same property that allows the vector sum
that represents a document to be similar to individual
term vector summands. (Similarity between normal-
ized vectors is measured by their inner product.)
Note that bootstrapping takes into account local
word posi[OCRerr]ioning when assigning the context vector
representation for stems. Moreover it is nearly in-
variant with respect to document divisions within the
training corpus. This contrasts with those methods
102
where stem representations are determined solely by
those documen[OCRerr]s in which the stem lies.
2.3 Context Vectors for Documents
Once we have generated context vectors for stems it is
easy to compute the context vector for a document.
We simply take a weighted sum of context vectors
for all stems appearing in the document2 and then
normalize the sum. This procedure applies to docu-
ments in the training corpus as well as to new docu-
ments. When adding up stem context vectors, we can
use term frequency weights similar to conventional IR
systems.
2.4 Context Vectors for Queries; Rel-
evance Feedback
Query context vectors are formed similarly to docu-
ment context vectors. For each stem in the query we
can apply a user-specified weight (default 1.0). Then
we can sum the corresponding context vectors and
normalize the result.
Note that it is easy to implement traditional rel-
evance feedback. The user can specify documents
(with weights) and the document context vectors are
merely added in with the context vectors from the
other terms. We can also find documents close to a
given document by using the document context vec-
tor as a query context vector.
2.5 Retrieval
The basic retrieval operation is simple; we find the
document context vector closest to the query context
vector and return it. There are several important
points to note.
1. As many documents as desired may be retrieved,
and the distances from the query context vector
give some measure of retrieval quality.
2. Because document context vectors are normal-
ized, we may simply find the document d that
maximized the dot product with the query con-
text vector, [OCRerr]
m[OCRerr]\{Vd .
d
3. It is easy to combine keyword match with con-
text vectors. We first use the match as a filter
for documents and return documents in order by
closeness to the query vectors. If all matching
2stopwords are discarded.