SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Latent Semantic Indexing (LSI) and TREC-2
chapter
S. Dumais
National Institute of Standards and Technology
D. K. Harman
transformed term-document matrix as input, and
calculates the best "reduced-dimension91
approximation to this matrix. The result of the SVD
analysis is a reduced-dimension vector for each term
and each document, and a vector of the singular
values. The number of dimensions, k, was between
200 and 300 in our experiments. This reduced-
dimensional representation is used for retrieval. The
cosine between term-term, document-document, or
term-document vectors is used as the measure of
similarity between them.
We were recendy able to compute the SYD analysis
of the full 742k x 104k matrix described above, but
not in time to include the results in this paper. For the
runs submitted, we used a sample of documents from
the above matrix. When appropriate, the documents
that were not sampled were "folded in" to the resulting
reduced-dimension ISI space. In all cases, we used
the weights from the 742k x 104k matrix (and did not
recompute them for our samples).
For the routing experiments, we used the subset of
do;cnents for which we had relevance judgements.
There were 68385 unique documents with relevance
judgements. The SVD analysis was computed on the
relevant 68385 document x 88112 term subset of the
above matrix, containing 14461782 non-zero cells. A
204 reduced-dimensional approximation took 18 hrs
of CPU time to compute on a SpardO workstation.
This 204-dimensional representation was used for
matching and retrieval.
For the adhoc experiments, we took a random sample
of 70000 documents. A reduced-dimensional SVD
approximation was computed on a 69997 document x
82%8 term matrix (7666044 non-zeros). The
resulting 199 reduced-dimensional representation was
used for retrieval. The 672331 documents not
included in this sample were "folded in" to the 199-
dimension LsI space, and the adhoc queries were
compared against all 742k documents.
These "folded in" documents were located at the
weighted vector sum of their coustituent terms. That
is, the vector for a new document was computed using
the term vectors for all terms in the document. For
documents that are actually present in the term-
document matrix, this derived vector corresponds
exacfly to the document vector given by the SVD.
New terms can be added in an analogous fashion. The
vector for new terms is computed using the document
vectors of all documents in which the term appears.
When adding documents and terms in this manner, we
assume that the derived semantic space" is fixed and
107
that new items can be fit into it. In general, this is not
the same space that one would obtain if a new SYD
were calculated using both the original and new
documents. `[1 previous experiments, we found that
sampling and scaling 50% of the documents, and
"folding in" the remaing documents resulted in
peiformance that was indistinguishable from that
observed when all documents were scaled. Here,
however, the scaling is based on less than 10% of the
total corpus.
We also had (from ThEC-1) LSI analyses of the 9
subeollections in CD-i and CD-2.
3.3 Queries and retrieval
Queries were automatically processed in the same way
as documents. For queries derived from the topic
statement, we began with the full text of each topic
(all topic fields), and stripped out the SGML field
identifiers. For feedback queries, we used the full text
of relevant documents. A query vector (or new
document in the case of routing) indicating the
frequency with which each term appears in the query
was automatically generated for each topic. The
query was transformed using SMART's "ltc"
weighting.
Note that we did not use any Boolean connectors or
proximity operators in query formulation. Cibe
implicit connectives, as in ordinary vector methods,
fall somewhere between ORs and ANDs, but with an
additional kind of "flizziness" introduced by the
dimension-reduced association matrix representation
of terms and documents.)
The terms in the query are used to identify a vector in
the LSI space; recall that each term has a vector
representation in the space. A query is simply located
at the weighted vector sum of its constituent term
vectors. The cosine between the query vector and
every document vector is computed, and documents
are ranked in decreasing order of similarity to the
query. (Although there are many fewer dimensions
than in standard vector retrieval, the entries are almost
all non-zero so inverted indices are not useful. This
means that each query must be compared to every
document.)
For 200-dimensional vectors, about 60,000 cosines
can be computed per minute on a Sparc2. This time
includes both comparison time and ranking time, and
assumes that all document vectors are pre-loaded into
memory. For the adhoc queries, the time to compare a
query to the 743k documents was about 10 minutes if