NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)

SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Latent Semantic Indexing (LSI) and TREC-2 chapter S. Dumais National Institute of Standards and Technology D. K. Harman 7. Appendix Latent Semantic Indexing (LSI) uses singular-value decomposition (SVD), a technique closely related to eigenvector decomposition and factor analysis (Cullum and Willoughby, 1985). We take a large term-document matrix and decompose it into a set of k, typically 100 to 3009 orthogonal factors froni which the original matrix can be approximated by linear combination. More formally, any rectangular matrix, X, for example a txd matrix of terms and documents, can be decomposed into the product of three other matrices: =T0[OCRerr]S0[OCRerr]D0', such that T0 and D0 have orthonormal columns, S0 is diagonal, and r is the rank of X. This is so-called singular value decomposition of X. If only the k largest singular values of SO are kept along with their corresponding columns in the T0 and D0 matrices, and the rest deleted (yielding matrices S, T and D), the resulting matrix, X, is the unique matrix of rank k that is closest in the least squares sense to X: xx = [OCRerr] The idea is that the x matrix, by containing only the first k independent linear components of X, captures the major associational structure in the matrix and throws out noise. It is this reduced model, usually with k 100, that we use to approximate the term to document association data in X. Since the number of dimensions in the reduced model (k) is much smaller than the number of unique terms (t), niinor differences m terminology are ignored. In this reduced model, the closeness of documents is determined by the overall pattern of term usage, so documents can be near each other regardless of the precise words that are used to describe them, and their description depends on a kind of consensus of their term meanings, thus damperling the effects of polysemy. `[1 particular, this means that documents which share no words with a user's query may still be near it if that is consistent with the major patterns of word usage. We use the term "semantic" indexing to describe our method because the reduced SVD representation captures the major associative relationships between terms and documents. One can also interpret the analysis pefformed by SVD geometrically. The result of the SVD is a k - dimensional vector representing the location of each 115 term and document in the k-dimensional representation. The location of term vectors reflects the correlations in their usage across documents. In this space the cosine or dot product between vectors corresponds to their estimated similarity. Since both term and document vectors are represented in the same space, similarities between any combination of terms and documents can be easily obtaihed. Retrieval proceeds by using the terms in a query to identify a point in the space, and all documents are then ranked by their similarity to the query. We make no attempt to interpret the underlying dimensions or factors, nor to rotate them to some intuitively meaningfiil orientation. The analysis does not require us to be able to describe the factors verbally but merely to be able to represent terms, documents and queries in a way that escapes the unreliability, ambiguity and redundancy of individual terms as descriptors. choosing the appropriate number of dimensions for the LSI representation is an open research question. Ideally, we want a value of k that is large enough to fit all the real structure in the data, but small enough so that we do not also fit the sampling error or unimportant details. If too many dimensions are used, the method begins to approximate standard vector methods and loses its power to represent the s[OCRerr]larity between words. If too few dimensions are used, there is not enough discrinilnation among similar words and documents. We find that performance improves as k increases for a while, and then decreases (Dumais, 1991). That LSI typically works well with a relatively small (compared to the number of unique terms) number of dimensions shows that these dimensions are, in fact, capturing a major portion of the meaningffl structure.