SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Vector Expansion in a Large Collection
chapter
E. Voorhees
Y-W. Hou
National Institute of Standards and Technology
Donna K. Harman
% Increase # Docs
0-10 15656
10-20 52568
20-30 116766
30[OCRerr]0 162862
40-50 163235
50[OCRerr]0 123943
60-70 64029
70-80 24564
80-90 8872
90-100 3333
> 100 6926
> 200 207
Mean increase: 42%
Table 1: Histogram of Percentage Increase in Document Vector Length
Expansion affects both the inverse document frequency (IDF) and the term frequency (TF) components
of the concept weights. A concept that is frequently added to documents is downweighted by its IDF factor
relative to its weight in an unexpanded collection. Such a concept is often a very general concept and the
downweighting is likely to be beneficial. Similarly, a concept that is occasionally added to documents, and
occurs infrequently in the collection otherwise, is emphasized by its IDF component. This may or may
not be beneficial, depending on the quality of the term. The aggregate TF component of a concept can
be relatively larger in an expanded collection if the concept has many synonyms. This effect is common
because the same word will frequently cause the same synonyms to be added in both the document and
query vectors. Unfortunately, this effect is usually detrimental because the words occurring in large synsets
are common words that contain little content. For example, if either `couple' and `pair' or the Roman
numeral `II' appears in a text, then the entire synset {two, [OCRerr] ii, twain, couple, pair, twosome, duo, duet,
brace, span, yoke, couplet, distich, dyad, duad, deuce, doubleton, craps, snake[OCRerr]eyes} is added.
The effects of the changes in weights is illustrated by the performance of topics 95 and 70, the texts of
which are given in Figures 2 and 3. Portions of the corresponding query vectors for both expanded and
unexpanded collections are given in Figure 4. Topic 95 retrieved 28 relevant documents in the expanded col-
lection; in the corresponding unexpanded collection, 17 relevant documents were retrieved. The improvement
is due to increasing the weights of central themes of the topic, both by adding additional concepts (outlaw,
constabI) and emphasizing existing concepts (law, sleuth). On the other hand, topic 70 retrieved only 32
relevant documents in the top 200 in the expanded collection while in the unexpanded collection 41 relevant
documents were retrieved. The degradation is due to the downweighting of `surrogate', which was added to
many documents and thus has a smaller IDF weight in the expanded collection, and the increased weight
for `mother' (compounded by the addition of `matern'), resulting in a marked preference for documents that
contain mother, whether or not they also contain surrogate.
The major difficulty of the expansion process is controlling which original terms get expanded and which
terms they are expanded by. In our algorithm, any word can be expanded if it occurs only once in WordNet or
if there is another word that has a common synonym. Although the agreement criterion is imposed to prevent
synonyms of the wrong senses of words from being added, it is not sufficient for the task. Furthermore, to
save processing time we do not tag a word with its part of speech prior to looking it up in WordNet, so
many words that are used as verbs and adjectives in the text are nonetheless found in the noun division of
WordNet (frequently in only one sense!) and add spurious relatives. The consequence of these factors is that
in addition to the concepts that are added for marginally useful words, concepts that have no bearing on
the content of the text may also be added to its vector.
347