SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Latent Semantic Indexing (LSI) and TREC-2
chapter
S. Dumais
National Institute of Standards and Technology
D. K. Harman
Latent Semantic Indexing (LSI) and TREC-2
Susan T. Dumais
Beilcore
445 South St.
Morristown, NJ 07960
std@bellcore.com
1. Overview of Latent Semantic Indexing
Latent Semantic Indexing (LSI) is an extension of the
vector retrieval method (e.g., Salton & McGill, 1983)
in which the dependencies between terms are
explicidy taken into account in the representation and
exploited in retrieval. This is done by simultaneously
modeling all the interrelationships among terms and
documents. We assume that there is some underlying
or "latent" structure in the pattern of word usage
across documents, and use statistical techniques to
estimate this latent structure. A description of terms,
documents and user queries based on the underlying,
"latent semantic11, structure (rather than surface level
word choice) is used for representing and retrieving
information. One advantage of the LSI representation
is that a query can be very similar to a document even
when they share no words.
Latent Semantic Indexing (LSI) uses singular-value
decomposition (SVD), a technique closely related to
eigenvector decomposition and factor analysis
(Cullum and Willoughby9 1985), to model the
associative relationships. A large term-document
matrix is decomposed it into a set of k, typically 100
to 300, orthogonal factors from which the original
matrix can be approximated by linear combination.
Instead of representing documents and queries direcdy
as sets of independent words, LSI represents them as
continuous values on each of the k orthogonal
indexing dimensions. Since the number of factors or
dimensions is much smaller than the number of unique
terms, words will not be independent. For example, if
two terms are used in similar contexts (documents),
they will have simllar vectors in the reduced-
dimension LSI representation. The SYD technique
can capture such structure better than simple term-
term or document-document correlations and clusters.
LSI partially overcomes some of the deficiencies of
assuming independence of words, and provides a way
of dealing with synonymy antomatically without the
105
need for a manually constructed thesaurus. LSI is a
completely automatic method. [OCRerr]e Appendix
provides a brief overview of the mathematics
underlying the LSL(SVD method. Deerwester et al.,
1990, and Furnas et al., 1988 present additional
mathematical details and examples.)
One can also interpret the analysis performed by SVD
geo[netrically. The result of the SVD is a vector
representing the location of each term and document
in the k-dimensional LSI representation. The location
of term vectors reflects the correlations in their usage
across documents. In this space the cosine or dot
product between vectors corresponds to their
estimated similarity. Retrieval typically proceeds by
using the terms in a query to identify a point in the
space, and all documents are then ranked by their
simllarity to the query. However, since both term and
document vectors are represented in the same space,
sintilarities between any combination of terms and
documents can be easily obtained.
The LSI method has been applied to many of the
standard IR collections with favorable results. Using
the same tokenization and term weightings, the LSI
method has equaled or outperformed standard vector
methods and other variants in aimost every case, and
was as much as 30% better in some cases ([)eerwester
et al., 1990). As with the standard vector method,
differential term weighting and relevance feedback
both improve LSI performance substantially (Dumais,
1991). LSI has also been applied in experiments on
relevance feedback (Dumais and Schmitt, 1991), and
in filtering applications [OCRerr]oltz and Dumais, 1992).
The recent MatcIiPlus system described by Gallant et
al. (1992) is related to LSI. Both systems model the
relationships between terms by looking at the
sintilarity of the contexts in which words are used, and
exploit these associations to improve retrieval. Both
systems use a reduced dimension vector