SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Vector Expansion in a Large Collection
chapter
E. Voorhees
Y-W. Hou
National Institute of Standards and Technology
Donna K. Harman
5. After all the words from a text have been processed, relatives that are flagged as being from a single-
sense text word and relatives that have been added to the relative list at least twice are added to the
vector list. The requirement to appear in the list at least twice if the relative is from a text word that
has multiple senses is a poor-man's attempt at sense disambiguation. The idea is that is that if two
original text terms agree on a relative, the relative is probably related to the correct sense of those text
terms.
6. To produce the final weighted vector, the term frequency of each of the concepts in the vector list
produced in step five is computed. Concepts that were added as relatives have their term frequency
weight multiplied by .8 to emphasize the original terms. The term frequency weight of each concept is
then multiplied by an inverse document frequency factor, and those weights are further normalized by
the square root of the sum of the squares of the weights (cosine normalization). This weighting scheme
is the "tfc" weights described by Salton and Buckley in [6].
As an example, take the text coud opinions and decisions on surrogaie mo[OCRerr]herhood (a paraphrase of
topic 70). The vector produced for this text using the synsets shown in Figure 1 would contain the stems of
coud, opinion, decision, surroga[OCRerr]e, mo[OCRerr]herhood (original text words), ma[OCRerr]erni[OCRerr]y (synonym of `motherhood',
which has only one sense), judgmen[OCRerr], and judgemen[OCRerr] (synonyms of both `decision' and `opinion').
Both documents and topic statments were indexed by the procedure described above; no special manual
processing of the topic statements was performed. The manually assigned keywords associated with some
documents were not used in the indexing. For the topic statements only the Concepts, Description, Factors,
Narrative, Nationality, and Title sections were indexed. Among those sections, no distinctions were made
regarding what section a term appeared in.
The decision to expand both documents and query vectors, as opposed to only query vectors, is based on
several factors. First, the WordNet synsets contain collocations such as `judicial decision', but the tokenizer
used recognizes only single words. For the collocations to participate in matches, both documents and vectors
need to be expanded. Second, documents are frequently longer than topic statements. Since we require
agreement on a relative before it is added to the vector, the longer documents provide more opportunities
for a concept to be added to the vector. Third, in the experiments on smaller collections, expanding both
documents and queries was consistently more effective than expanding only queries (although usually less
effective than expanding neither). Document expansion has its costs, however, even excluding the obvious
additional expense at indexing time. Longer vectors also increase storage costs and processing time at
retrieval as well. Table 1 gives a histogram of the percentage increase in vector length as compared to an
unexpanded collection for the TREC documents.
4 Experimental Results
We performed one retrieval run on the entire TREC database, retrieving documents for the 50 ad hoc queries.
The official evaluation table for this run is given in Table 2. Using a Sun IPX with 64 megabytes of RAM,
it took approximately 42 hours of processing to produce the inverted index of the document collection.
The resulting inverted index takes 947 megabytes of disk storage. It took approximately one CPU second
on average to index a topic statement and produce a query vector. The average retrieval time per query
was 15 CPU seconds.
An analysis of the retrieval results show that the expanded collection is more effective than a correspond-
ing unexpanded collection for some queries2. However, the effectiveness of the the expansion procedure
is very variable, and the performance of other queries was degraded by the expansion. This variability is
attributable to two main factors: the process of selecting which new concepts to add, and the confounding
effects expansion has on concept weights.
2Evaluation results for the unexpanded collection were made available through the courtesy of the SMART group at
Cornell University.
346