SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Latent Semantic Indexing (LSI) and TREC-2
chapter
S. Dumais
National Institute of Standards and Technology
D. K. Harman
7. Appendix
Latent Semantic Indexing (LSI) uses singular-value
decomposition (SVD), a technique closely related to
eigenvector decomposition and factor analysis
(Cullum and Willoughby, 1985). We take a large
term-document matrix and decompose it into a set of
k, typically 100 to 3009 orthogonal factors froni which
the original matrix can be approximated by linear
combination.
More formally, any rectangular matrix, X, for
example a txd matrix of terms and documents, can be
decomposed into the product of three other matrices:
=T0[OCRerr]S0[OCRerr]D0',
such that T0 and D0 have orthonormal columns, S0 is
diagonal, and r is the rank of X. This is so-called
singular value decomposition of X.
If only the k largest singular values of SO are kept
along with their corresponding columns in the T0 and
D0 matrices, and the rest deleted (yielding matrices S,
T and D), the resulting matrix, X, is the unique matrix
of rank k that is closest in the least squares sense to X:
xx
= [OCRerr]
The idea is that the x matrix, by containing only the
first k independent linear components of X, captures
the major associational structure in the matrix and
throws out noise. It is this reduced model, usually
with k 100, that we use to approximate the term to
document association data in X. Since the number of
dimensions in the reduced model (k) is much smaller
than the number of unique terms (t), niinor
differences m terminology are ignored. In this
reduced model, the closeness of documents is
determined by the overall pattern of term usage, so
documents can be near each other regardless of the
precise words that are used to describe them, and their
description depends on a kind of consensus of their
term meanings, thus damperling the effects of
polysemy. `[1 particular, this means that documents
which share no words with a user's query may still be
near it if that is consistent with the major patterns of
word usage. We use the term "semantic" indexing to
describe our method because the reduced SVD
representation captures the major associative
relationships between terms and documents.
One can also interpret the analysis pefformed by SVD
geometrically. The result of the SVD is a k -
dimensional vector representing the location of each
115
term and document in the k-dimensional
representation. The location of term vectors reflects
the correlations in their usage across documents. In
this space the cosine or dot product between vectors
corresponds to their estimated similarity. Since both
term and document vectors are represented in the
same space, similarities between any combination of
terms and documents can be easily obtaihed.
Retrieval proceeds by using the terms in a query to
identify a point in the space, and all documents are
then ranked by their similarity to the query. We make
no attempt to interpret the underlying dimensions or
factors, nor to rotate them to some intuitively
meaningfiil orientation. The analysis does not require
us to be able to describe the factors verbally but
merely to be able to represent terms, documents and
queries in a way that escapes the unreliability,
ambiguity and redundancy of individual terms as
descriptors.
choosing the appropriate number of dimensions for
the LSI representation is an open research question.
Ideally, we want a value of k that is large enough to fit
all the real structure in the data, but small enough so
that we do not also fit the sampling error or
unimportant details. If too many dimensions are used,
the method begins to approximate standard vector
methods and loses its power to represent the s[OCRerr]larity
between words. If too few dimensions are used, there
is not enough discrinilnation among similar words and
documents. We find that performance improves as k
increases for a while, and then decreases (Dumais,
1991). That LSI typically works well with a relatively
small (compared to the number of unique terms)
number of dimensions shows that these dimensions
are, in fact, capturing a major portion of the
meaningffl structure.