SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Feedback and Mixing Experiments with MatchPlus
chapter
S. Gallant
W. Caid
J. Carleton
T. Gutschow
R. Hecht-Nielsen
K. Qing
D. Sudbeck
National Institute of Standards and Technology
D. K. Harman
Feedbacl( and Mixing Experiments with MatchPlus
Stephen I. Gallant
William R. Caid[OCRerr] Joel Carletont
Todd W. Gutschowt
Robert Hecht[OCRerr]Nie1sent Kent P11 Qjngf David Sudbeck[OCRerr]
HNC Inc
Abstract
We briefly review the MatchPlus system and describe
recent developments with learning word representa-
tions, experiments with relevance feedback using neu-
ral network learning algorithms, and methods for
combining different output lists.
1 Introduction
HNC is developing a neural network related approach
to document retrieval called MatchPlus'. Goals of
this approach include high precision/recall perfor-
mance, ease of use, incorporation of machine learning
algorithms, and sensitivity to similarity of use.
To understand our notion of sensitivity to similar-
ity of use, consider the four words: `car', `automobile',
`driving', and `hippopotamus'. `Car' and `autom[OCRerr]
bile' are synonyms and they very often occur together
in documents; `car' and `driving' are related words
(but not synonyms) that sometimes occur together
in documents; and `car' and `hippopotamus' are es-
sentially unrelated words that seldom occur within
the same document. We want the system to be sen-
sitive to such similarity of use, much like a built-in
thesaurus, yet without the drawbacks of a thesaurus,
such as domain dependence or the need for hand-
entry of synonyms. In particular we want a query on
`car' to prefer a document containing `drive' to one
containing `hippopotamus', and we want the system
itself to be able to figure this out from the corpus.
The implementation of AlatchPlus is motivated by
neural networks, and designed to interface with neu-
ral network learning algorithms. High-dimensional
([OCRerr] 300) vectors, called context vectors, represent
word stems, documents, and queries in the same vec-
tor space. This representation permits one type of
*124 Mt Auburn St, Suite 200. Cambridge, MA 02138
t5501 Oberlin Drive, San Diego, CA 92121.
1Patents pending.
101
neural network learning algorithm to generate stem
context vectors that are sensitive to similarity of use,
and a more standard neural network algorithm to per-
form routing and automatic query modification based
upon user feedback, as described below
Queries can take the form of terms, full documents,
parts of documents, and/or conventional Boolean ex-
pressions. Optional weights may also be included.
The following sections give a brief overview of our
implementation, and look at some recent improve-
ments and experiments. For a previous description of
the approach and comments on complexity considera-
tions see [1]; a longer journal article is in preparation.
2 The Context Vector Ap-
proach
One of the most important aspects of MatchPlus is
its representation of words (stems), documents, and
queries by high ([OCRerr] 300) dimensional vectors called
context vectors. By representing all objects in the
same high dimensional space we can easily:
1.
Form a document context vector as the
(weighted) vector sum of the context vectors for
those words (stems) contained in the document.
2. Form a query context vector as the (weighted)
vector sum of the context vectors for those words
(stems) contained in the query.
3. Compute the distance of a query Q to any doc-
ument. Moreover if document context vectors
are normalized, the closest document d (in Eu-
clidean distance) has the context vector Vd that
gives highest dot product with the query context
vector [OCRerr]
<closest d> = {dIvd.VQ is maximized for d E D}
(proof: 11Vd vQ112 - 1Vd1j2+11VQj12[OCRerr]2(Vd [OCRerr]Q) -
const - 2(Vd .