SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Vector Expansion in a Large Collection chapter E. Voorhees Y-W. Hou National Institute of Standards and Technology Donna K. Harman statistical expansion techniques are viable for large collections. The results so far indicate that our expansion technique can improve the performance of some queries, but is equally likely to degrade the performance of others. The sources of this variability are described in detail below. A description of WordNet and the expansion algorithm is given first to provide the appropriate context. 2 WordNet WordNet is a manually-constructed lexical system developed by George Miller and his colleagues at the Cognitive Science Laboratory at Princeton University [3]. Originating from a project whose goal was to produce a dictionary that could be searched conceptually instead of only alphabetically, WordNet evolved into a system that reflects current psycholinguistic theories about how humans organize their lexical memories. The basic object in WordNet is a set of strict synonyms, called a synset. By definition, each synset a word appears in is a different sense of that word. There are three main divisions in WordNet, one each for nouns, verbs, and adjectives. Within a division, synsets are organized by the lexical relationships defined on them. For nouns, the only division used in this study, the lexical relationships include antonymy, hypernymy/hyponymy (IS-A relation) and three different meronym/holonym (PART-OF) relations. The IS-A relation is the dominant relationship, and organizes the synsets iiito a set of approximately ten hierarchies1. Examples of synsets that are the heads of hierarchies are { entity, thin g}, {psychologicaljeature}, { abstraction}, and {possession}. The developers of WordNet specifically avoided including specialized vocabularies within WordNet; the coverage of "standard" English is quite good. The April, 1992 version of WordNet (the version used in this study) contains 35,155 synonym sets and 67,293 senses in the noun division. The majority of synonym sets are quite small (one or two members), but the more common nouns (i.e., those nouns that actually get used in documents and topics) tend to belong to the larger synsets. Example synsets from the noun division are shown in Figure 1. The lexical relationships that the synsets participate in, especially their parents in the IS-A hierarchy, differentiate among the senses. We developed our own routine to access the WordNet information that differs somewhat from the access code distributed with WordNet. In our version, the access routine takes a word (a string of characters), converts it to lower case, and checks if the converted string occurs in the noun portion of WordNet. If the string is found, the routine returns either the number of synsets in which the string appears, the fact that the string is a known irregular morphological variant of a member of a synset (e.g., `women' is an inflection of `woman'), or both (e.g., `media' is both a member of {media, mass[OCRerr]media} and an inflection of `medium'). If the string is not found, several simple (regular) morphological variants of the word are tried. If none are found, the routine reports the string as not found. Otherwise, the routine returns the base form. A consequence of this simple strategy is that regular plural forms that are members of their own synsets do not return the synsets of the base word. For example, `arms' returns the synsets { coaLoLarms, arms, blazon, blazonry} and {weaponry, arms, implements[OCRerr]oJLwar}, but not the four synsets for arm 3 Vector Expansion Procedure For the retrieval results reported in this paper, both document and query vectors were expanded using synonyms of original text words. The particular expansion method we used is one of the most effective vector expansion methods among a wide variety of expansion schemes we tried on smaller collections. However, the TREC collection is much more diverse than those collections and some other scheme may be more effective on it. We intend to test some of those methods on the TREC collection in the near future. We use the SMART retrieval system developed at Cornell as the basis for our retrieval system [1). The SMART system is designed to facilitate information retrieval research by making it easy to substitute 1The actuai structure is not quite a hierarchy since a few synsets have more than one parent. 344