SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Vector Expansion in a Large Collection
chapter
E. Voorhees
Y-W. Hou
National Institute of Standards and Technology
Donna K. Harman
Vector Expansion in a Large Collection
Ellen M. Voorhees and Yuan-Wang Hou
Siemens Corporate Research, Inc.
755 College Road East
Princeton, New Jersey 08540
Abstract
This paper investigates whether a completely automatic, statistical expansion technique that uses a
general-purpose thesaurus as a source of related concepts is viable for large collections. The retrieval
results indicate that the particular expansion technique used here improves the performance of some
queries, but degrades the performance of other queries. The overall effectiveness of the method is com-
petitive with other systems. The variability of the method is attributable to two main factors: the choice
of concepts that are expanded and the confounding effects expansion has on concept weights. Addressing
these problems will require both a better method for determining the important concepts of a text and
a better method for determining the correct sense of an ambiguous word.
1 Introduction
In many retrieval systems the similarity between two texts is a function of the number of word stems that
appear in both texts. While these systems are often efficient and robust, their effectiveness is depressed
by the presence of homographs (words that are spelled the same but mean different things) and synonyms
(different words that mean the same thing) in the texts. Homographs depress precision by causing false
matches. Synonyms depress recall by causing conceptual matches to be missed. That is, if a query and
a document are about the same topic, but use different words to express the idea, the document will not
be retrieved in response to the query. We are investigating how concep[OCRerr] spaces, data structures that define
semantic relationships among ideas, can be used to mitigate the effects of synonymy and homography in
retrieval systems designed to satisfy large-scale information needs.
We impose two constraints on our research with the goal of making the resulting methods more applicable
to retrieving documents from large corpora. First, we want to keep human intervention in the indexing and
retrieval processes at a minimum; therefore, we use strictly automatic procedures. Second, even automatic
procedures need to be relatively efficient. We believe this efficiency requirement precludes the use of deep
analyses of document content for the foreseeable future, and we restrict ourselves to statistical processing of
the text and concept space.
There are effectiveness concerns when dealing with large corpora as well as efficiency concerns. Large
corpora usually imply a diverse vocabulary, and thus the synonym and homograph problems are exacerbated.
In this paper we investigate vector expansion as a solution to the synonymy problem for the large TREC
collection. As the name "vector expansion implies, we are working within the vector space model of
information retrieval [5): both documents and topics are represented as weighted vectors and the similarity
between two texts is computed as the inner product of their respective vectors. The vectors are expanded
by terms related to original text words in our concept space. In particular, since we are using the WordNet
lexical database as our concept space, vectors are expanded by adding selected synonyms of original text
words.
Although using thesauri to expand vectors has been done before (see, for example, [7], [2), [4)), it has
always been done on small collections. We are interested in investigating whether comparatively simple
343