SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Vector Expansion in a Large Collection chapter E. Voorhees Y-W. Hou National Institute of Standards and Technology Donna K. Harman Vector Expansion in a Large Collection Ellen M. Voorhees and Yuan-Wang Hou Siemens Corporate Research, Inc. 755 College Road East Princeton, New Jersey 08540 Abstract This paper investigates whether a completely automatic, statistical expansion technique that uses a general-purpose thesaurus as a source of related concepts is viable for large collections. The retrieval results indicate that the particular expansion technique used here improves the performance of some queries, but degrades the performance of other queries. The overall effectiveness of the method is com- petitive with other systems. The variability of the method is attributable to two main factors: the choice of concepts that are expanded and the confounding effects expansion has on concept weights. Addressing these problems will require both a better method for determining the important concepts of a text and a better method for determining the correct sense of an ambiguous word. 1 Introduction In many retrieval systems the similarity between two texts is a function of the number of word stems that appear in both texts. While these systems are often efficient and robust, their effectiveness is depressed by the presence of homographs (words that are spelled the same but mean different things) and synonyms (different words that mean the same thing) in the texts. Homographs depress precision by causing false matches. Synonyms depress recall by causing conceptual matches to be missed. That is, if a query and a document are about the same topic, but use different words to express the idea, the document will not be retrieved in response to the query. We are investigating how concep[OCRerr] spaces, data structures that define semantic relationships among ideas, can be used to mitigate the effects of synonymy and homography in retrieval systems designed to satisfy large-scale information needs. We impose two constraints on our research with the goal of making the resulting methods more applicable to retrieving documents from large corpora. First, we want to keep human intervention in the indexing and retrieval processes at a minimum; therefore, we use strictly automatic procedures. Second, even automatic procedures need to be relatively efficient. We believe this efficiency requirement precludes the use of deep analyses of document content for the foreseeable future, and we restrict ourselves to statistical processing of the text and concept space. There are effectiveness concerns when dealing with large corpora as well as efficiency concerns. Large corpora usually imply a diverse vocabulary, and thus the synonym and homograph problems are exacerbated. In this paper we investigate vector expansion as a solution to the synonymy problem for the large TREC collection. As the name "vector expansion implies, we are working within the vector space model of information retrieval [5): both documents and topics are represented as weighted vectors and the similarity between two texts is computed as the inner product of their respective vectors. The vectors are expanded by terms related to original text words in our concept space. In particular, since we are using the WordNet lexical database as our concept space, vectors are expanded by adding selected synonyms of original text words. Although using thesauri to expand vectors has been done before (see, for example, [7], [2), [4)), it has always been done on small collections. We are interested in investigating whether comparatively simple 343