NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)

SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Appendix C: System Features appendix National Institute of Standards and Technology Donna K. Harman System Summary and Timing University of Central Florida General Conunents The fimings should be the tune to replicate runs [OCRerr]m scratch, not including trial runs, etc. The times should also be re[OCRerr][OCRerr]onably accurate. This sometunes will be difficult, such as getting total time ror document indexing of huge text sections, or m£'uiu£..llly build[OCRerr]ig a kilowledge base. Please do your best. I. Construction of indices, knowledge bases, and other data structures (please describe all data structures that your system needs for se[OCRerr]lrChin(J) A. Which of tlie folk)wing were used to build your data structures? 1. stopword list yes a. how many words in list? 166 stop words, 122 al)l)reviati()ns, 47 hyphenated words, 24 entries for al)I)reviati()ns and alternate n()ti()ns for months, 35 entries for legitimate words `lot to Ile prefixed, and 6 entries for legitimate pretixes. 2. is [OCRerr]t coutrolled v(icabul('uy used? Il([OCRerr] 3. stemmin[OCRerr]' yes a. st'wd£u-d stemming alg()ri[OCRerr]Ins which ones? .J.B. I[OCRerr]()vins' Stemming Algorithm (nl()dltied). b. m()1[OCRerr]h()l()gical [OCRerr]uialysis Ilolle 4. telin weighting yes 5. phrase discovery [OCRerr] 6. syntactic p£Lr5i11(2 no 7. word sense dis[OCRerr]nbigu'.1ti()n Yes. The semantic lexicon we used is l)ased ()[OCRerr] word senses f()und in Roget's Thesaurus. 8. heuristic ass()ci[OCRerr]Lti()ns Ilo 9. spelling checking (with inanutI c()11[OCRerr]ecti()n) no 10. spellin(2 corlection no 11. proper IloUII identification algoritlim Ilo 12. tokenizer (recognizes dites, phone numbers, coirunon patterns) yes a. which patterns £`ue tokenized? The QA System recognizes dates. But we felt it was not useful f[OCRerr])r the NIST experiment so we removed this feature to improve text processing speed. 13. [OCRerr]`[OCRerr]re the m[OCRerr]'uiu£illy-indexed tenns used? 110 14. other techniques used to build d[OCRerr]ta structures (brief descuption) The QA System uses B-tree storage structures ti)r inverted index tile access and semantic lexicon access. But for the NIST experiments, we used the QA System text scanning al)ility and coupled it with hash tal)le access (replacing tile B-tree access) and the use of 32-l)it Codes for text strings. B. Statistics on data structures built from TREC text (please fill out each applicable section) 1. inverted index yes a. tOtLil [OCRerr]un()unt of st()ra[OCRerr]e (meg[OCRerr]ibytes) For Vol.1 the index storage was 385 megahytes. b. total computer tune to build (approx[OCRerr]nate number of hours) 73 hours using nine IBM 5([OCRerr] MHz 486 PCs running in parallel. c. is the Pr([OCRerr]C55 completely automatic? yes 480