NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)

SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Appendix C: System Features appendix National Institute of Standards and Technology Donna K. Harman System Summary and Timing Belicore General CoinmexiLs The timings should be the time to replicate runs from scratch, not including trial runs, etc. The t[OCRerr]nes should also be reasonably accurate. This soinetilnes will be difficult, such `[OCRerr] gettin[OCRerr] total time ft)r document indexing of huge text sections, or manually buildin(2 a k'iowledge base. Please do your best. I. Construction of indices, knowledge b[OCRerr][OCRerr]es, and other data structures (pleΩOCRerr][OCRerr]e describe all data your system needs for se[OCRerr]irching) structures that A. Which of the following were used to build your data structures? 1. stopword list yes (though SoniC experiments without stoplist) a. how many words in list? n=439; standard SMART list, I think 2. is a controlled v([OCRerr]abul[OCRerr]uy used? no 3. steinining fl()flC (except truncation at 20 character.'[OCRerr]wd) 4. tenn weighting yes, l()g(tt)*(1[OCRerr]entr()py) 5. phrase discovery no 6. syntactic p[OCRerr]'irsing no 7. word sense disainbiguation no 8. heuristic ΩOCRerr]ssociations no 9. spelling checking (with manual correction) [OCRerr]() 10. spelling correction no (not directly, l)ut the LSI analyses does some of this fi)r free 11. proper noun identification [OCRerr]dg()rithIn Ilo 12. tokenizer (recognizes dates, phone numbers, coininon pattenis) 13. are the manually-indexed terms used? no 14. other techniques used to build [OCRerr]ta structures (brief description) LSIISVD analysis of term[OCRerr]l)y-d()cument matrix. Takes raw term-hy-doc matrix; transforms entries using log entr([OCRerr]py term weightings; calculates hest "reduced-dimensi()nal" approximation to transformed matrix using SVD. Numl)er of dimensions 250-350. Does all (1uery-doc matching in this reduced-dimension vector space. B. Statistics on data structures built from TREC text (please fill out each applicable section) 5. other data structures built from [OCRerr][OCRerr]REC text (what?) LSIISVD uses reduced-dimensi()nal vectors (see l)elow fi)r description of how they are derived). The numl)er of dims was I)etween 235 and 250. There is one such vector for each term and fi)r each d(K:ument. Queries are also represented as vectors and compared to every document. a. total ainount of st()rξge (Ine(2[OCRerr].Ibytes) All reduced dimensional vectors are stored in a hinary datahase. Datahase c([OCRerr]sists [OCRerr] a vector fi)r every doc and every term occurring in more than one doc. The vectors currently consist ([OCRerr]f single precision real values. For TREC, we huilt (jne datahase fi)r each collection. Approx. 50000 docs are sampled. Terms that occur in more than one of these documents are used in the SVD analysis. The remaining docs are added to the database. DOEI - docs: 226([OCRerr]7, terms: 42221, ndim: 250-> 262 meg dl) 467