SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Vector Expansion in a Large Collection chapter E. Voorhees Y-W. Hou National Institute of Standards and Technology Donna K. Harman % Increase # Docs 0-10 15656 10-20 52568 20-30 116766 30[OCRerr]0 162862 40-50 163235 50[OCRerr]0 123943 60-70 64029 70-80 24564 80-90 8872 90-100 3333 > 100 6926 > 200 207 Mean increase: 42% Table 1: Histogram of Percentage Increase in Document Vector Length Expansion affects both the inverse document frequency (IDF) and the term frequency (TF) components of the concept weights. A concept that is frequently added to documents is downweighted by its IDF factor relative to its weight in an unexpanded collection. Such a concept is often a very general concept and the downweighting is likely to be beneficial. Similarly, a concept that is occasionally added to documents, and occurs infrequently in the collection otherwise, is emphasized by its IDF component. This may or may not be beneficial, depending on the quality of the term. The aggregate TF component of a concept can be relatively larger in an expanded collection if the concept has many synonyms. This effect is common because the same word will frequently cause the same synonyms to be added in both the document and query vectors. Unfortunately, this effect is usually detrimental because the words occurring in large synsets are common words that contain little content. For example, if either `couple' and `pair' or the Roman numeral `II' appears in a text, then the entire synset {two, [OCRerr] ii, twain, couple, pair, twosome, duo, duet, brace, span, yoke, couplet, distich, dyad, duad, deuce, doubleton, craps, snake[OCRerr]eyes} is added. The effects of the changes in weights is illustrated by the performance of topics 95 and 70, the texts of which are given in Figures 2 and 3. Portions of the corresponding query vectors for both expanded and unexpanded collections are given in Figure 4. Topic 95 retrieved 28 relevant documents in the expanded col- lection; in the corresponding unexpanded collection, 17 relevant documents were retrieved. The improvement is due to increasing the weights of central themes of the topic, both by adding additional concepts (outlaw, constabI) and emphasizing existing concepts (law, sleuth). On the other hand, topic 70 retrieved only 32 relevant documents in the top 200 in the expanded collection while in the unexpanded collection 41 relevant documents were retrieved. The degradation is due to the downweighting of `surrogate', which was added to many documents and thus has a smaller IDF weight in the expanded collection, and the increased weight for `mother' (compounded by the addition of `matern'), resulting in a marked preference for documents that contain mother, whether or not they also contain surrogate. The major difficulty of the expansion process is controlling which original terms get expanded and which terms they are expanded by. In our algorithm, any word can be expanded if it occurs only once in WordNet or if there is another word that has a common synonym. Although the agreement criterion is imposed to prevent synonyms of the wrong senses of words from being added, it is not sufficient for the task. Furthermore, to save processing time we do not tag a word with its part of speech prior to looking it up in WordNet, so many words that are used as verbs and adjectives in the text are nonetheless found in the noun division of WordNet (frequently in only one sense!) and add spurious relatives. The consequence of these factors is that in addition to the concepts that are added for marginally useful words, concepts that have no bearing on the content of the text may also be added to its vector. 347