SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
LSI meets TREC: A Status Report
chapter
S. Dumais
National Institute of Standards and Technology
Donna K. Harman
2.2.1 Pre-processing and indexing
This includes the time for sampling documents (if necessary), processing the raw ascii text,
creating the raw term-document matrix, calculating the log*entropy term weights (which
requires two-passes), and transforming the matrix entries using a log*entropy weights. A
combination of C-code, awk, and shell-scripts were used. The time required for this depends on
the amount of raw text, the number of terms, and the number of documents.
2.22 SVD
The SVD program computes the best "k-dimensional" approximation to the transformed term-
document matrix. We used a sparse, iterafive Single-Vector Lanczos SYD code (Berry, 1992).
The code is written in ANSI Fortran-77 using double precision arithmetic, and is available from
Netlib. The number of singular values (dimensions) calculated for the TREC subcollections
ranged from 235 to 310. As it turned out, we used only 235 or 250 dimensions for retrieval, so
fewer dimensions could have been computed. Thus some of the reported SYD times are higher
than necessary - in particular, the SYD times for APi, ZIFFi, and FRi would be approximately
20% lower if only 250 dimensions had been computed.
2.2.3 Adding new documents
The time required to add new documents includes the time to pre-process and index the text of
the new documents as well as the time to compute the new document vectors.
2.2.4 I/O translation
Because several existing tools were patched together for the TREC experiments, there were
some additional 1,0 translation involved. This will be removed soon.
adding
collection index SYD new docs I/O TOTAL (mins)
DOEl 49 1219 591 194 2053
WSJi 241 1474 404 174 2293
APi 271 1644 455 214 2584
ZIFFi 241 1359 352 156 2108
FRi 241 939 0 133 1313
WSJ2 427 1382 461 220 2490
AP2 338 1210 273 218 2039
ZIFF2 260 1452 0 208 1920
FR2 187 486 0 105 778
Table 2. Summary of LS1,SVD times (in minutes)
141