SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
TIPSTER Panel -- HNC's MatchPlus System
chapter
S. Gallant
R. Hecht-Nielson
W. Caid
K. Qing
J. Carleton
D. Sudbeck
National Institute of Standards and Technology
Donna K. Harman
HNC's MatchFlits System
Stephen I. Gallant* William R. Caidt Joel Carletont
Robert Hecht[OCRerr]Nielsent Kent Pu Qingt David Sudbeckt
HNC Inc
1 Introduction
HNC is developing a neural network related approach
to document retrieval called MatchPlus1. Goals of
this approach include high precision/recall perfor-
mance, ease of use, incorporation of machine learning
algorithms, and seilsitivity to similarity of use.
To understand our notion of sensitivity to similar-
ity of use, consider the four words: `car', `automobile',
`driving', and `hippopotamus'. `Car' and `automo-
bile' are synonyms and they very often occur together
in documents; `car' and `driving' are related words
(but not synonyms) that sometimes occur together
in documents; and `car' and `hippopotamus' are es-
sentially unrelated words that seldom occur within
the same document. We want the system to be sen-
sitive to such similarity of use, much like a built-in
thesaurus, yet without the drawbacks of a thesaurus,
such as domain dependence or the need for hand-
entry of synonyms. In particular we want a query on
`car' to prefer a document containing `drive' to one
containing `hippopotamus', and we want the system
itself to be able to figure this out from the corpus.
The implementation of MatchPlus is motivated by
neural networks, and designed to interface with neu-
ral network learning algorithms. High-dimensional
( 300) vectors, called context vectors, represent
word stems, documents, and queries in the same vec-
tor space. This representation permits one type of
neural network learning algorithm to generate stem
context vectors that are sensitive to similarity of use,
and a more standard neural network algorithm to per-
form routing and automatic query modification based
upon user feedback.
Queries can take the form of terms, full documents,
parts of documents, and/or conventional Boolean cx-
pressions. Optional weights may also be included.
The following sections give a brief overview of our
implementation, and look at some preliminary test
results. For a previous description of the approach
and comments on complexity considerations see
a longer journal article is in preparation.
2 The Context Vector Ap-
proach
One of the most important aspects of MatchPlus is
its representation of words (stems), documents, and
queries by high ( 300) dimensional vectors called
context vectors. By representing all objects in the
same high dimensional space we can easily:
1. Form a document context vector as the
(weighted) vector sum of the context vectors for
those words (stems) contained in the document.
2. Form a query context vector as the (weighted)
vector sum of the context vectors for those words
(stems) contained in the query.
3. Compute the distance of a query Q to any doc-
ument. Moreover if document context vectors
are normalized, the closest document d (in Eu-
clidean distance) has the context vector Vd that
gives highest dot product with the query context
vector [OCRerr]
<closest d> - {dIVd.VQ is maximized for d E D}
(proof: 11Vd vQII2 - IIvdII2+IIvQII2[OCRerr]2(vd.vQ) =
const - 2(Vd . VQ).)
4. Find the closest document to a given document
d by treating Vd as a query vector.
5. Perform relevance feedback. If d is a relevant
document for query Q, form a new query vector
[OCRerr]124 Mt Auburn St, Suite 200, Canibridge, MA 02138
t5501 Oberlin Drive, San Diego, CA 92121.
1 Patents pending. - [OCRerr] + aVd
107