SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
LSI meets TREC: A Status Report
chapter
S. Dumais
National Institute of Standards and Technology
Donna K. Harman
Choosing the appropriate number of dimensions for the LSI representation is an open research
question. Ideally, we want a value of k that is large enough to fit all the real structure in the data,
but small enough so that we do not also fit the sampling error or unimportant details. If too
many dimensions are used, the method begins to approximate standard vector methods and loses
its power to represent the similarity between words. If too few dimensions are used, there is not
enough discrimination among similar words and documents. We typically find that performance
improves as k increases for a while, and then decreases ([)umais, 1991). That LSI typically
works well with a relatively small (compared to the number of unique terms) number of
dimensions shows that these dimensions are, in fact, capturing a major portion of the meaningful
structure.
152