SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Vector Expansion in a Large Collection chapter E. Voorhees Y-W. Hou National Institute of Standards and Technology Donna K. Harman 5. After all the words from a text have been processed, relatives that are flagged as being from a single- sense text word and relatives that have been added to the relative list at least twice are added to the vector list. The requirement to appear in the list at least twice if the relative is from a text word that has multiple senses is a poor-man's attempt at sense disambiguation. The idea is that is that if two original text terms agree on a relative, the relative is probably related to the correct sense of those text terms. 6. To produce the final weighted vector, the term frequency of each of the concepts in the vector list produced in step five is computed. Concepts that were added as relatives have their term frequency weight multiplied by .8 to emphasize the original terms. The term frequency weight of each concept is then multiplied by an inverse document frequency factor, and those weights are further normalized by the square root of the sum of the squares of the weights (cosine normalization). This weighting scheme is the "tfc" weights described by Salton and Buckley in [6]. As an example, take the text coud opinions and decisions on surrogaie mo[OCRerr]herhood (a paraphrase of topic 70). The vector produced for this text using the synsets shown in Figure 1 would contain the stems of coud, opinion, decision, surroga[OCRerr]e, mo[OCRerr]herhood (original text words), ma[OCRerr]erni[OCRerr]y (synonym of `motherhood', which has only one sense), judgmen[OCRerr], and judgemen[OCRerr] (synonyms of both `decision' and `opinion'). Both documents and topic statments were indexed by the procedure described above; no special manual processing of the topic statements was performed. The manually assigned keywords associated with some documents were not used in the indexing. For the topic statements only the Concepts, Description, Factors, Narrative, Nationality, and Title sections were indexed. Among those sections, no distinctions were made regarding what section a term appeared in. The decision to expand both documents and query vectors, as opposed to only query vectors, is based on several factors. First, the WordNet synsets contain collocations such as `judicial decision', but the tokenizer used recognizes only single words. For the collocations to participate in matches, both documents and vectors need to be expanded. Second, documents are frequently longer than topic statements. Since we require agreement on a relative before it is added to the vector, the longer documents provide more opportunities for a concept to be added to the vector. Third, in the experiments on smaller collections, expanding both documents and queries was consistently more effective than expanding only queries (although usually less effective than expanding neither). Document expansion has its costs, however, even excluding the obvious additional expense at indexing time. Longer vectors also increase storage costs and processing time at retrieval as well. Table 1 gives a histogram of the percentage increase in vector length as compared to an unexpanded collection for the TREC documents. 4 Experimental Results We performed one retrieval run on the entire TREC database, retrieving documents for the 50 ad hoc queries. The official evaluation table for this run is given in Table 2. Using a Sun IPX with 64 megabytes of RAM, it took approximately 42 hours of processing to produce the inverted index of the document collection. The resulting inverted index takes 947 megabytes of disk storage. It took approximately one CPU second on average to index a topic statement and produce a query vector. The average retrieval time per query was 15 CPU seconds. An analysis of the retrieval results show that the expanded collection is more effective than a correspond- ing unexpanded collection for some queries2. However, the effectiveness of the the expansion procedure is very variable, and the performance of other queries was degraded by the expansion. This variability is attributable to two main factors: the process of selecting which new concepts to add, and the confounding effects expansion has on concept weights. 2Evaluation results for the unexpanded collection were made available through the courtesy of the SMART group at Cornell University. 346