SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) LSI meets TREC: A Status Report chapter S. Dumais National Institute of Standards and Technology Donna K. Harman proper noun identification, complex tokenizers, a controlled vocabulary, a thesaurus, or any manual indexing. The entries in the term-document matrix were then transformed using a 1og((tf[OCRerr]+1))x(1-entropy:)) weighting. The weight assigned to each term was: 1 - entropy or noise, where entropy = [OCRerr] Ptd log [OCRerr] ndocs is the number of documents in the collection, t is the [OCRerr] log (ndocs)' index for terms, d is the index for documents, Pgd= [OCRerr] tf[OCRerr] is the frequency of term t in document d, and gf£ is the global frequency of occurrence of term t in the collection. (For simplicity, we refer to this as the log*entropy term weight.) The transformed term-document matrix was used as input to the SYD. The results of the SVD analysis, a k-dimensional real vector for each term and each document and k singular values, were stored in a database. The terms and their log*entropy weights were also stored in a database. Each of the 9 subcollections was processed separately. Because of software constraints, the initial indexing and SYD analysis were done on a random subset of 20,000-57,000 documents. The remaining documents were added into the resulting data structure as described below. (We have recently completed an SVD analysis of the complete 226,000 document by 90,000 term DOE collection, but this was not used in the experiments reported below.) Table 1 summarizes the number of terms and documents in the samples used for scaling as well as the total number of terms and documents in the databases. SYD scaling sampled added total total databas[OCRerr],3 collection docs terms docs docs terms ndim me DOEl 50000 42221 176087 226087 42221 250 262 WSJ1 49555 70019 49556 99111 70019 250 169 APi 42465 78167 42465 84930 78167 250 163 ZIFFi 37590 60565 37590 75180 60565 250 135 FRi 26207 54713 0 26207 54713 250 80 WSJ2 50000 76080 24520 74520 76080 235 141 AP2 50000 82997 29923 79923 82997 235 153 ZIFF2 56920 72197 0 56920 72197 235 121 FR2 20108 48728 0 20108 48728 235 64 totals 382845 585687 360141 742986 585687 Table 1. Summary of 9 subcollections NOThS 1. Inthe union of the 9 subcollections, there were 585687 word tokens, and 200785 word types. 2. In general, database size will be: (ndocs+nterms)*ndim*4 3. The total combined database size was 1288 meg (750000 docs and 585000 terms). If a single combined database had been used, the total database size would have been smaller 139