SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) TIPSTER Panel -- HNC's MatchPlus System chapter S. Gallant R. Hecht-Nielson W. Caid K. Qing J. Carleton D. Sudbeck National Institute of Standards and Technology Donna K. Harman HNC's MatchFlits System Stephen I. Gallant* William R. Caidt Joel Carletont Robert Hecht[OCRerr]Nielsent Kent Pu Qingt David Sudbeckt HNC Inc 1 Introduction HNC is developing a neural network related approach to document retrieval called MatchPlus1. Goals of this approach include high precision/recall perfor- mance, ease of use, incorporation of machine learning algorithms, and seilsitivity to similarity of use. To understand our notion of sensitivity to similar- ity of use, consider the four words: `car', `automobile', `driving', and `hippopotamus'. `Car' and `automo- bile' are synonyms and they very often occur together in documents; `car' and `driving' are related words (but not synonyms) that sometimes occur together in documents; and `car' and `hippopotamus' are es- sentially unrelated words that seldom occur within the same document. We want the system to be sen- sitive to such similarity of use, much like a built-in thesaurus, yet without the drawbacks of a thesaurus, such as domain dependence or the need for hand- entry of synonyms. In particular we want a query on `car' to prefer a document containing `drive' to one containing `hippopotamus', and we want the system itself to be able to figure this out from the corpus. The implementation of MatchPlus is motivated by neural networks, and designed to interface with neu- ral network learning algorithms. High-dimensional ( 300) vectors, called context vectors, represent word stems, documents, and queries in the same vec- tor space. This representation permits one type of neural network learning algorithm to generate stem context vectors that are sensitive to similarity of use, and a more standard neural network algorithm to per- form routing and automatic query modification based upon user feedback. Queries can take the form of terms, full documents, parts of documents, and/or conventional Boolean cx- pressions. Optional weights may also be included. The following sections give a brief overview of our implementation, and look at some preliminary test results. For a previous description of the approach and comments on complexity considerations see a longer journal article is in preparation. 2 The Context Vector Ap- proach One of the most important aspects of MatchPlus is its representation of words (stems), documents, and queries by high ( 300) dimensional vectors called context vectors. By representing all objects in the same high dimensional space we can easily: 1. Form a document context vector as the (weighted) vector sum of the context vectors for those words (stems) contained in the document. 2. Form a query context vector as the (weighted) vector sum of the context vectors for those words (stems) contained in the query. 3. Compute the distance of a query Q to any doc- ument. Moreover if document context vectors are normalized, the closest document d (in Eu- clidean distance) has the context vector Vd that gives highest dot product with the query context vector [OCRerr] <closest d> - {dIVd.VQ is maximized for d E D} (proof: 11Vd vQII2 - IIvdII2+IIvQII2[OCRerr]2(vd.vQ) = const - 2(Vd . VQ).) 4. Find the closest document to a given document d by treating Vd as a query vector. 5. Perform relevance feedback. If d is a relevant document for query Q, form a new query vector [OCRerr]124 Mt Auburn St, Suite 200, Canibridge, MA 02138 t5501 Oberlin Drive, San Diego, CA 92121. 1 Patents pending. - [OCRerr] + aVd 107