SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) LSI meets TREC: A Status Report chapter S. Dumais National Institute of Standards and Technology Donna K. Harman 2.2.1 Pre-processing and indexing This includes the time for sampling documents (if necessary), processing the raw ascii text, creating the raw term-document matrix, calculating the log*entropy term weights (which requires two-passes), and transforming the matrix entries using a log*entropy weights. A combination of C-code, awk, and shell-scripts were used. The time required for this depends on the amount of raw text, the number of terms, and the number of documents. 2.22 SVD The SVD program computes the best "k-dimensional" approximation to the transformed term- document matrix. We used a sparse, iterafive Single-Vector Lanczos SYD code (Berry, 1992). The code is written in ANSI Fortran-77 using double precision arithmetic, and is available from Netlib. The number of singular values (dimensions) calculated for the TREC subcollections ranged from 235 to 310. As it turned out, we used only 235 or 250 dimensions for retrieval, so fewer dimensions could have been computed. Thus some of the reported SYD times are higher than necessary - in particular, the SYD times for APi, ZIFFi, and FRi would be approximately 20% lower if only 250 dimensions had been computed. 2.2.3 Adding new documents The time required to add new documents includes the time to pre-process and index the text of the new documents as well as the time to compute the new document vectors. 2.2.4 I/O translation Because several existing tools were patched together for the TREC experiments, there were some additional 1,0 translation involved. This will be removed soon. adding collection index SYD new docs I/O TOTAL (mins) DOEl 49 1219 591 194 2053 WSJi 241 1474 404 174 2293 APi 271 1644 455 214 2584 ZIFFi 241 1359 352 156 2108 FRi 241 939 0 133 1313 WSJ2 427 1382 461 220 2490 AP2 338 1210 273 218 2039 ZIFF2 260 1452 0 208 1920 FR2 187 486 0 105 778 Table 2. Summary of LS1,SVD times (in minutes) 141