SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Latent Semantic Indexing (LSI) and TREC-2 chapter S. Dumais National Institute of Standards and Technology D. K. Harman Latent Semantic Indexing (LSI) and TREC-2 Susan T. Dumais Beilcore 445 South St. Morristown, NJ 07960 std@bellcore.com 1. Overview of Latent Semantic Indexing Latent Semantic Indexing (LSI) is an extension of the vector retrieval method (e.g., Salton & McGill, 1983) in which the dependencies between terms are explicidy taken into account in the representation and exploited in retrieval. This is done by simultaneously modeling all the interrelationships among terms and documents. We assume that there is some underlying or "latent" structure in the pattern of word usage across documents, and use statistical techniques to estimate this latent structure. A description of terms, documents and user queries based on the underlying, "latent semantic11, structure (rather than surface level word choice) is used for representing and retrieving information. One advantage of the LSI representation is that a query can be very similar to a document even when they share no words. Latent Semantic Indexing (LSI) uses singular-value decomposition (SVD), a technique closely related to eigenvector decomposition and factor analysis (Cullum and Willoughby9 1985), to model the associative relationships. A large term-document matrix is decomposed it into a set of k, typically 100 to 300, orthogonal factors from which the original matrix can be approximated by linear combination. Instead of representing documents and queries direcdy as sets of independent words, LSI represents them as continuous values on each of the k orthogonal indexing dimensions. Since the number of factors or dimensions is much smaller than the number of unique terms, words will not be independent. For example, if two terms are used in similar contexts (documents), they will have simllar vectors in the reduced- dimension LSI representation. The SYD technique can capture such structure better than simple term- term or document-document correlations and clusters. LSI partially overcomes some of the deficiencies of assuming independence of words, and provides a way of dealing with synonymy antomatically without the 105 need for a manually constructed thesaurus. LSI is a completely automatic method. [OCRerr]e Appendix provides a brief overview of the mathematics underlying the LSL(SVD method. Deerwester et al., 1990, and Furnas et al., 1988 present additional mathematical details and examples.) One can also interpret the analysis performed by SVD geo[netrically. The result of the SVD is a vector representing the location of each term and document in the k-dimensional LSI representation. The location of term vectors reflects the correlations in their usage across documents. In this space the cosine or dot product between vectors corresponds to their estimated similarity. Retrieval typically proceeds by using the terms in a query to identify a point in the space, and all documents are then ranked by their simllarity to the query. However, since both term and document vectors are represented in the same space, sintilarities between any combination of terms and documents can be easily obtained. The LSI method has been applied to many of the standard IR collections with favorable results. Using the same tokenization and term weightings, the LSI method has equaled or outperformed standard vector methods and other variants in aimost every case, and was as much as 30% better in some cases ([)eerwester et al., 1990). As with the standard vector method, differential term weighting and relevance feedback both improve LSI performance substantially (Dumais, 1991). LSI has also been applied in experiments on relevance feedback (Dumais and Schmitt, 1991), and in filtering applications [OCRerr]oltz and Dumais, 1992). The recent MatcIiPlus system described by Gallant et al. (1992) is related to LSI. Both systems model the relationships between terms by looking at the sintilarity of the contexts in which words are used, and exploit these associations to improve retrieval. Both systems use a reduced dimension vector