NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)

SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Feedback and Mixing Experiments with MatchPlus chapter S. Gallant W. Caid J. Carleton T. Gutschow R. Hecht-Nielsen K. Qing D. Sudbeck National Institute of Standards and Technology D. K. Harman Feedbacl( and Mixing Experiments with MatchPlus Stephen I. Gallant William R. Caid[OCRerr] Joel Carletont Todd W. Gutschowt Robert Hecht[OCRerr]Nie1sent Kent P11 Qjngf David Sudbeck[OCRerr] HNC Inc Abstract We briefly review the MatchPlus system and describe recent developments with learning word representa- tions, experiments with relevance feedback using neu- ral network learning algorithms, and methods for combining different output lists. 1 Introduction HNC is developing a neural network related approach to document retrieval called MatchPlus'. Goals of this approach include high precision/recall perfor- mance, ease of use, incorporation of machine learning algorithms, and sensitivity to similarity of use. To understand our notion of sensitivity to similar- ity of use, consider the four words: `car', `automobile', `driving', and `hippopotamus'. `Car' and `autom[OCRerr] bile' are synonyms and they very often occur together in documents; `car' and `driving' are related words (but not synonyms) that sometimes occur together in documents; and `car' and `hippopotamus' are es- sentially unrelated words that seldom occur within the same document. We want the system to be sen- sitive to such similarity of use, much like a built-in thesaurus, yet without the drawbacks of a thesaurus, such as domain dependence or the need for hand- entry of synonyms. In particular we want a query on `car' to prefer a document containing `drive' to one containing `hippopotamus', and we want the system itself to be able to figure this out from the corpus. The implementation of AlatchPlus is motivated by neural networks, and designed to interface with neu- ral network learning algorithms. High-dimensional ([OCRerr] 300) vectors, called context vectors, represent word stems, documents, and queries in the same vec- tor space. This representation permits one type of *124 Mt Auburn St, Suite 200. Cambridge, MA 02138 t5501 Oberlin Drive, San Diego, CA 92121. 1Patents pending. 101 neural network learning algorithm to generate stem context vectors that are sensitive to similarity of use, and a more standard neural network algorithm to per- form routing and automatic query modification based upon user feedback, as described below Queries can take the form of terms, full documents, parts of documents, and/or conventional Boolean ex- pressions. Optional weights may also be included. The following sections give a brief overview of our implementation, and look at some recent improve- ments and experiments. For a previous description of the approach and comments on complexity considera- tions see [1]; a longer journal article is in preparation. 2 The Context Vector Ap- proach One of the most important aspects of MatchPlus is its representation of words (stems), documents, and queries by high ([OCRerr] 300) dimensional vectors called context vectors. By representing all objects in the same high dimensional space we can easily: 1. Form a document context vector as the (weighted) vector sum of the context vectors for those words (stems) contained in the document. 2. Form a query context vector as the (weighted) vector sum of the context vectors for those words (stems) contained in the query. 3. Compute the distance of a query Q to any doc- ument. Moreover if document context vectors are normalized, the closest document d (in Eu- clidean distance) has the context vector Vd that gives highest dot product with the query context vector [OCRerr] <closest d> = {dIvd.VQ is maximized for d E D} (proof: 11Vd vQ112 - 1Vd1j2+11VQj12[OCRerr]2(Vd [OCRerr]Q) - const - 2(Vd .