SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Vector Expansion in a Large Collection
chapter
E. Voorhees
Y-W. Hou
National Institute of Standards and Technology
Donna K. Harman
statistical expansion techniques are viable for large collections. The results so far indicate that our expansion
technique can improve the performance of some queries, but is equally likely to degrade the performance
of others. The sources of this variability are described in detail below. A description of WordNet and the
expansion algorithm is given first to provide the appropriate context.
2 WordNet
WordNet is a manually-constructed lexical system developed by George Miller and his colleagues at the
Cognitive Science Laboratory at Princeton University [3]. Originating from a project whose goal was to
produce a dictionary that could be searched conceptually instead of only alphabetically, WordNet evolved into
a system that reflects current psycholinguistic theories about how humans organize their lexical memories.
The basic object in WordNet is a set of strict synonyms, called a synset. By definition, each synset a word
appears in is a different sense of that word.
There are three main divisions in WordNet, one each for nouns, verbs, and adjectives. Within a division,
synsets are organized by the lexical relationships defined on them. For nouns, the only division used in this
study, the lexical relationships include antonymy, hypernymy/hyponymy (IS-A relation) and three different
meronym/holonym (PART-OF) relations. The IS-A relation is the dominant relationship, and organizes
the synsets iiito a set of approximately ten hierarchies1. Examples of synsets that are the heads of hierarchies
are { entity, thin g}, {psychologicaljeature}, { abstraction}, and {possession}.
The developers of WordNet specifically avoided including specialized vocabularies within WordNet; the
coverage of "standard" English is quite good. The April, 1992 version of WordNet (the version used in this
study) contains 35,155 synonym sets and 67,293 senses in the noun division. The majority of synonym sets
are quite small (one or two members), but the more common nouns (i.e., those nouns that actually get used
in documents and topics) tend to belong to the larger synsets. Example synsets from the noun division are
shown in Figure 1. The lexical relationships that the synsets participate in, especially their parents in the
IS-A hierarchy, differentiate among the senses.
We developed our own routine to access the WordNet information that differs somewhat from the access
code distributed with WordNet. In our version, the access routine takes a word (a string of characters),
converts it to lower case, and checks if the converted string occurs in the noun portion of WordNet. If the
string is found, the routine returns either the number of synsets in which the string appears, the fact that
the string is a known irregular morphological variant of a member of a synset (e.g., `women' is an inflection
of `woman'), or both (e.g., `media' is both a member of {media, mass[OCRerr]media} and an inflection of `medium').
If the string is not found, several simple (regular) morphological variants of the word are tried. If none
are found, the routine reports the string as not found. Otherwise, the routine returns the base form. A
consequence of this simple strategy is that regular plural forms that are members of their own synsets do not
return the synsets of the base word. For example, `arms' returns the synsets { coaLoLarms, arms, blazon,
blazonry} and {weaponry, arms, implements[OCRerr]oJLwar}, but not the four synsets for arm
3 Vector Expansion Procedure
For the retrieval results reported in this paper, both document and query vectors were expanded using
synonyms of original text words. The particular expansion method we used is one of the most effective vector
expansion methods among a wide variety of expansion schemes we tried on smaller collections. However, the
TREC collection is much more diverse than those collections and some other scheme may be more effective
on it. We intend to test some of those methods on the TREC collection in the near future.
We use the SMART retrieval system developed at Cornell as the basis for our retrieval system [1).
The SMART system is designed to facilitate information retrieval research by making it easy to substitute
1The actuai structure is not quite a hierarchy since a few synsets have more than one parent.
344