SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
LSI meets TREC: A Status Report
chapter
S. Dumais
National Institute of Standards and Technology
Donna K. Harman
LSI meets TREC: A Status Report
Susan T. Dumais
Belicore
445 South St.
Morristown, NJ 07960-1910
std@bellcore.com
1. Overview of Latent Semantic Indexing
Latent Semantic Indexing (LSI) is an extension of the vector retrieval method (e.g., Salton &
McGill, 1983) in which the dependencies between terms and between documents, in addition to
the associations between terms and documents, are explicitly taken into account. This is done by
simultaneously modeling all the association of terms and documents. We assume that there is
some underlying or "latent" structure in the pattern of word usage across documents, and use
statistical techniques to estimate this latent structure. A description of terms, documents and
user queries based on the underlying, "latent semantic", structure (rather than surface level word
choice) is used for representing and retrieving information.
Latent Semantic Indexing (LSI) uses singular-value decomposition (SVD), a technique closely
related to eigenvector decomposition and factor analysis (Cullum and Willoughby, 1985). A
large term-document matrix is decomposed it into a set of k, typically 100 to 300, orthogonal
factors from which the original matrix can be approximated by linear combination. Instead of
representing documents and queries directly as sets of independent words, LSI represents them
as continuous values on each of the k orthogonal indexing dimensions. Since the number of
factors or dimensions is much smaller than the number of unique terms, words will not be
independent. For example, if two terms are used in similar contexts (documents), they will have
similar vectors in the reduced-dimension LSI representation. The SVD technique can capture
such structure better than simple term-term or document-document correlations and clusters.
LSI partially overcomes some of the deficiencies of assuming independence of words, and
provides a way of dealing with synonymy automatically without the need for a manually
constructed thesaurus. LSI is a completely automatic method. (The Appendix provides a brief
overview of the mathematics underlying the LSI"SVD method. Deerwester et al., 1990, and
Furnas et al., 1988 present additional mathematical details and examples.)
One can also interpret the analysis performed by SYD geometrically. The result of the SVD is a
vector representing the location of each term and document in the k -dimensional LSI
representation. The location of term vectors reflects the correlations in their usage across
documents. In this space the cosine or dot product between vectors corresponds to their
estimated similarity. Since both term and document vectors are represented in the same space,
137