SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Vector Expansion in a Large Collection
chapter
E. Voorhees
Y-W. Hou
National Institute of Standards and Technology
Donna K. Harman
As an example of these effects, consider document FR89512-0147, President Bush's 1989 Mother's Day
Proclamation. This document was retrieved in response to topic 70 because it mentioned `mother' or
`motherhood' 16 times. Figure 5 contains an excerpt of the document and a sampling of the concepts
that were added. (Words that are in boldface in the excerpt are words that caused additional concepts to be
added.) Approximately 60 of the 250 concepts in the vector were added by the expansion process. About
half of the added concepts are the result of a wrong sense or a wrong part of speech being used in support
of its addition, and another 20 of the added concepts are correct, but unimportant.
Unfortunately, the ratio of unimportant and mistaken additions to reasonable additions exhibited by
document FR89512-0147 is not unusual. WordNet - and English - are rich enough such that it is likely for
two words in a text to be synonyms of (different senses of) a third word. Using additional lexical relations
compounds this problem: the experiments we conducted on smaller collections show a marked degradation
in effectiveness if any any of the other relations represented in WordNet are used in addition to synonymy
to expand a concept.
5 Conclusion
We have demonstrated a fully automatic, statistical expansion technique that is capable of improving the
effectiveness of some queries relative to a corresponding unexpanded collection for a large, full-text collection.
The overall effectiveness of the technique is competitive with other retrieval methods. However, the technique
is hampered by its unpredictability, which has at least three sources:
* errors in selecting the correct sense, and therefore the correct relatives, of a text word,
* no determination of the relative importance of a word to the text before deciding to expand it, and
* the complex interaction between expansion and term weighting.
Since we believe the disambiguation of word senses to be the most fundamental of these three problems, and
also useful in its own right, our current research lies in this direction.
References
[1] Chris Buckley. Implementation of the SMART information retrieval system. Technical Report 85-686,
Computer Science Department, Cornell University, Ithaca, New York, May 1985.
[2) Edward A. Fox. Lexical relations: Enhancing effectiveness of information retrieval systems. SIGIR
Newsletter, 15(3), 1981.
[3) George Miller. Special Issue, WordNet: An on-line lexical database. International Journal of Lexicogra-
phy, 3(4), 1990.
[4) G. Salton and M. E. Lesk. Computer evaluation of indexing and text processing. In Gerard Salton,
editor, The SMART Retrieval System: Experiments in Automatic Document Processing, pages 143-180.
Prentice-Hall, Inc. Englewood Cliffs, New Jersey, 1971.
[5) G. Salton, A. Wong, and C. S. Yang. A vector space model for automatic indexing. Communications of
the ACM, 18(11):613-620, November 1975.
[6] Gerard Salton and Chris Buckley. Term weighting approaches in automatic text retrieval. Information
Processing and Management, 24:513-523,1988.
[7) Yih-Chen Wang, James Vandendorpe, and Martha Evens. Relational thesauri in information retrieval.
Journal of the American Society for Information Science, 36(1):15-27, January 1985.
350