SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Latent Semantic Indexing (LSI) and TREC-2
chapter
S. Dumais
National Institute of Standards and Technology
D. K. Harman
representation, but differ in how the term, document
and query vectors are formed.
2. LSI and TREC-1
We used the ThEC-1 conference as an opporunlty to
"scale up" our tools, and to explore the LSI
dimension-reduction ideas using a very rich corpus of
word usage. We we[OCRerr] pleased that we were able to
use many of the existing LS[OCRerr]SVD tools on the large
collection. (See Dumais, 1993 for details.) We were
able to compute the SVDs of 50k docs x 75k words
matrices without numerical or convergence problems
on a standard Dec5OOO or SpardO workstation.
Because of these limits on the size of the matrices we
could handle, we divided the [OCRerr]IREC-1 documents into
9 separate subeollections (APi, DOEl, FRi, WSJi,
ZIFFi, AP2, FR2, WSJ2, ZIFF2) and worked with
these. There are also some theoretical reasons why
working with subcollections makes sense. By using
topically coherent subcollections one can get better
discrimination among documents. When the ZIFF
subcollection is analyzed separately, for example, 200
or so dimensions can be devoted to differences among
computer-related topics. When these same documents
are part of a large corpus, many fewer indexing
dimensions will be devoted to discriminating among
them.
interms of accuracy, LSI performance in TREC-i
was about average. There were some obvious
problems with our inltial pre-processing of documents
(e.g., "U.S." and 11A.T.T." were omitted), and there
were some unanticipated problems in coinbining
across subeollections. Since many of the top
performing automatic systems used SMART's
preprocessing, we chose to do so as well for TREC-2.
In addition, using the SMART software for this
purpose allows us to compare LSI with the
comparable vector method, so that we can examine
the contribution of LSI per se. We also planned on
comparing an LSI analysis using subeollections (from
`IREC-i) with an LSI analysis of the entire collection
for TREC-2.
We had high hopes of being able to build on our
TREC-i work for TREC-2. in practice, however, the
changes in the pre-processing algorithm, and the
decision to use a single combined LsI analysis
resulted m our starting from scratch" in many
respects. We have completed soine experiments for
TREC-2, but we did not get as far as we would have
liked, especially for the adhoc topics.
106
3. LSI and TREC-2
3.1 Pre-processing
We used the SMART system1 for pre-processing the
documents and queries. Some markups (e.g. <>
delimiters) were removed, and all hand-indexed
entries were removed from the WSJ and ZIFF
collections. Upper case characters w&e translated into
lower case, punctuation was removed, and white
spaces were used to delimit terms. The SMART stop
list of 571 words was used as is. The SMART
stemmer (a modified Lovins algorithm) was used
without modification to strip words endings. We did
not use: phrases, proper noun identification, word
sense disambiguation, a thesaurus, syntactic or
semantic parsing, spelling checking or coffection,
complex tokenizers, a controlled vocabulary, or any
manual indexing.
The result of this pre-processmg can be thought of as a
term-document matrix, in which each cell entry
indicates the fi[OCRerr]quency with which a term appears in a
document. The entries in the term-document matrix
were then transformed using an "ltc" weighting. The
"ltc" weighting takes the log of individual cell entries,
multiplies each entry for a term (row) by the WF
weight of the term, and then normalizes the document
(col) length.
We began by processing the 742358 documents from
CD-i and CD-2. Using the mimal pre-processing
described above, there were 960765 unique tokens,
512251 unique stems, and 81901331 non-zero entries
in the term-document matrix. 742331 documents
contained at least one term. To decrease the matrix to
a size we thought we could handle, we removed
tokens occuiring in fewer than 5 documents. This
resulted in 781421 unique tokens, 104533 unique
stemmed words, and 81252681 non-zero entries. We
used the resulting 742331 document x 104533 term
matrix as the starting point for results reported in this
paper. The "ltc" weights were computed on this
matrix.
3.2 SVD analysis
The "ltc" matrix described above was used as input to
the SVD algorithm. The SVD program takes the ite
1. The SMART system (version 1 l[OCRerr]) was made available through
the SMART group at Cornell University. Chris Buckley was
especially generous in consultations about how to get the
software to do somewhat non-standard things.