SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
LSI meets TREC: A Status Report
chapter
S. Dumais
National Institute of Standards and Technology
Donna K. Harman
proper noun identification, complex tokenizers, a controlled vocabulary, a thesaurus, or any
manual indexing. The entries in the term-document matrix were then transformed using a
1og((tf[OCRerr]+1))x(1-entropy:)) weighting. The weight assigned to each term was: 1 - entropy or noise,
where entropy = [OCRerr] Ptd log [OCRerr] ndocs is the number of documents in the collection, t is the
[OCRerr] log (ndocs)'
index for terms, d is the index for documents, Pgd= [OCRerr] tf[OCRerr] is the frequency of term t in
document d, and gf£ is the global frequency of occurrence of term t in the collection. (For
simplicity, we refer to this as the log*entropy term weight.) The transformed term-document
matrix was used as input to the SYD. The results of the SVD analysis, a k-dimensional real
vector for each term and each document and k singular values, were stored in a database. The
terms and their log*entropy weights were also stored in a database.
Each of the 9 subcollections was processed separately. Because of software constraints, the
initial indexing and SYD analysis were done on a random subset of 20,000-57,000 documents.
The remaining documents were added into the resulting data structure as described below. (We
have recently completed an SVD analysis of the complete 226,000 document by 90,000 term
DOE collection, but this was not used in the experiments reported below.)
Table 1 summarizes the number of terms and documents in the samples used for scaling as well
as the total number of terms and documents in the databases.
SYD scaling
sampled added total total databas[OCRerr],3
collection docs terms docs docs terms ndim me
DOEl 50000 42221 176087 226087 42221 250 262
WSJ1 49555 70019 49556 99111 70019 250 169
APi 42465 78167 42465 84930 78167 250 163
ZIFFi 37590 60565 37590 75180 60565 250 135
FRi 26207 54713 0 26207 54713 250 80
WSJ2 50000 76080 24520 74520 76080 235 141
AP2 50000 82997 29923 79923 82997 235 153
ZIFF2 56920 72197 0 56920 72197 235 121
FR2 20108 48728 0 20108 48728 235 64
totals 382845 585687 360141 742986 585687
Table 1. Summary of 9 subcollections
NOThS
1. Inthe union of the 9 subcollections, there were 585687 word tokens, and 200785 word
types.
2. In general, database size will be: (ndocs+nterms)*ndim*4
3. The total combined database size was 1288 meg (750000 docs and 585000 terms). If a
single combined database had been used, the total database size would have been smaller
139