SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
LSI meets TREC: A Status Report
chapter
S. Dumais
National Institute of Standards and Technology
Donna K. Harman
Appendix
Latent Semantic Indexing (LSI) uses singular-value decomposition (SVD), a technique closely
related to eigenvector decomposition and factor analysis (Cullum and Willoughby, 1985). We
take a large term-document matrix and decompose it into a set of k, typically 100 to 300,
orthogonal factors from which the original matrix can be approximated by linear combination.
More formally, any rectangular matrix, X, for example a t xd matrix of terms and documents, can
be decomposed into the product of three other matrices:
X =T0S0D0',
:xd gxrrxrrxd
such that T0 and D0 have orthonormal columns, S0 is diagonal, and r is the rank of X. This is so-
called singular value decomposition of X.
If only the k largest singular values of S0 are kept along with their corresponding columns in the
T0 and D0 matrices, and the rest deleted (yielding matrices S, T and D), the resulting matrix, X, is
the unique matrix of rank k that is closest in the least squares sense to X:
X T SD'.
:xd ixd txkkxkkxd
The idea is that the X matrix, by containing only the first k independent linear components of X,
captures the major associational structure in the matrix and throws out noise. It is this reduced
model, usually with k 100, that we use to approximate the term to document association data in
X. Since the number of dimensions in the reduced model (k) is much smaller than the number of
unique terms (t), minor differences in terminology are ignored. In this reduced model, the
closeness of documents is determined by the overall pattern of term usage, so documents can be
near each other regardless of the precise words that are used to describe them, and their
description depends on a kind of consensus of their term meanings, thus dampening the effects
of polysemy. In particular, this means that documents which share no words with a user's query
may still be near it if that is consistent with the major patterns of word usage. We use the term
"semantic" indexing to describe our method because the reduced SVD representation captures
the major associative relationships between terms and documents.
One can also interpret the analysis performed by SVD geometrically. The result of the SVD is a
k-dimensional vector represenfing the location of each term and document in the k-dimensional
representation. The location of term vectors reflects the correlations in their usage across
documents. In this space the cosine or dot product between vectors corresponds to their
estimated similarity. Since both term and document vectors are represented in the same space,
similarities between any combinafion of terms and documents can be easily obtained. Retrieval
proceeds by using the terms in a query to identify a point in the space, and all documents are
then ranked by their similarity to the query. We make no attempt to interpret the underlying
dimensions or factors, nor to rotate them to some intuitively meaningful orientation. The
analysis does not require us to be able to describe the factors verbally but merely to be able to
represent terms, documents and queries in a way that escapes the unreliability, ambiguity and
redundancy of individual terms as descriptors.
151