SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Appendix C: System Features
appendix
National Institute of Standards and Technology
Donna K. Harman
System Summary and Timing
Belicore
General CoinmexiLs
The timings should be the time to replicate runs from scratch, not including trial runs, etc. The t[OCRerr]nes should also
be reasonably accurate. This soinetilnes will be difficult, such `[OCRerr] gettin[OCRerr] total time ft)r document indexing of huge
text sections, or manually buildin(2 a k'iowledge base. Please do your best.
I. Construction of indices, knowledge b[OCRerr][OCRerr]es, and other data structures (ple£[OCRerr][OCRerr]e describe all data
your system needs for se[OCRerr]irching)
structures that
A. Which of the following were used to build your data structures?
1. stopword list yes (though SoniC experiments without stoplist)
a. how many words in list? n=439; standard SMART list, I think
2. is a controlled v([OCRerr]abul[OCRerr]uy used? no
3. steinining fl()flC (except truncation at 20 character.'[OCRerr]wd)
4. tenn weighting yes, l()g(tt)*(1[OCRerr]entr()py)
5. phrase discovery no
6. syntactic p[OCRerr]'irsing no
7. word sense disainbiguation no
8. heuristic £[OCRerr]ssociations no
9. spelling checking (with manual correction) [OCRerr]()
10. spelling correction
no (not directly, l)ut the LSI analyses does some of this fi)r free
11. proper noun identification [OCRerr]dg()rithIn Ilo
12. tokenizer (recognizes dates, phone numbers, coininon pattenis)
13. are the manually-indexed terms used? no
14. other techniques used to build [OCRerr]ta structures (brief description)
LSIISVD analysis of term[OCRerr]l)y-d()cument matrix. Takes raw term-hy-doc matrix;
transforms entries using log entr([OCRerr]py term weightings; calculates hest
"reduced-dimensi()nal" approximation to transformed matrix using SVD. Numl)er
of dimensions 250-350. Does all (1uery-doc matching in this reduced-dimension
vector space.
B. Statistics on data structures built from TREC text (please fill out each applicable section)
5. other data structures built from [OCRerr][OCRerr]REC text (what?)
LSIISVD uses reduced-dimensi()nal vectors (see l)elow fi)r description of how they are
derived). The numl)er of dims was I)etween 235 and 250. There is one such vector
for each term and fi)r each d(K:ument. Queries are also represented as vectors and
compared to every document.
a. total ainount of st()r£ige (Ine(2[OCRerr].Ibytes)
All reduced dimensional vectors are stored in a hinary datahase. Datahase
c([OCRerr]sists [OCRerr] a vector fi)r every doc and every term occurring in more than
one doc. The vectors currently consist ([OCRerr]f single precision real values. For
TREC, we huilt (jne datahase fi)r each collection.
Approx. 50000 docs are sampled. Terms that occur in more than one of
these documents are used in the SVD analysis. The remaining docs are
added to the database.
DOEI - docs: 226([OCRerr]7, terms: 42221, ndim: 250-> 262 meg dl)
467