SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Appendix C: System Features appendix National Institute of Standards and Technology Donna K. Harman System Summary and Timing Universitaet Dortmtind Phrase automatic ad hoc (Fuhr leariling) General Cominent.[OCRerr] The fiming.[OCRerr] .[OCRerr]hould be the time to replicate run[OCRerr] from scratch, not including trial runs, etc. The times should also be reasonably accurate. This sometimes will be ditTicult, such as getting total time ft)r d()(:ument indexing of huge text sections, or mailually building a [OCRerr]owledge base. Please do your best. I. Construction of indices, kfl()wlCdLTe bases, and other data structures (please describe all data structures that your system needs fi)r seŁircliin[OCRerr]) A. Which of the following were used to build your data structures? 1. stopword list a. how many words in list? 570 2. is a coiltrolled v([OCRerr]abul[OCRerr]iry used? Not f[OCRerr])r single tern's. A phrase list was aut([OCRerr]rnatically c([OCRerr]nstructed from phrases occurring 25 times or more In the first doc set (Dl). ouly those phrases were used. 3. ste1[OCRerr]ing yes a. st[OCRerr]d[OCRerr]ud stelnining algorithms which ones? SMART b. morphological alialysis 4 tenn weighting In docs, linear c()ml)inati()n of several factors In (lueries, tf * idf, c(jsine normalization (ntc) 5. phrase discovery a. what kind of phrase? Adjacent n(Jn-st()pwords, comp()nenLs stemmed, that occurred at least 25 times In the Dl document set. b. using statistical meth(ids c. using syiltactic methods 6. syiltactic p([OCRerr]sing no 7. word sense disambiguation no 8. heuristic associations no 9. spelling checking (with manual correction) no 10. spelling correction [OCRerr] 11. proper Iloull identification algorithm no 12. tokenizer (reco(2nizes dŁttes, phone numbers, cominon patterns) no 13. [OCRerr]ire the m[OCRerr][OCRerr]ually-indexed terms used? no 14. other techniques used to build data structures (brief description) Coefficients f[OCRerr])r linear c([OCRerr]mhinati()Ils used in weighting were determined automatically using QI,DI,judgments of QI ([OCRerr]n DI. This took 2.4 hours (not including 5.6 hours to index QI,Dl). B. Statistics on data structures built from TREC text (please fill out each applicable section) I. inverted index a. total amomit of storage (megabytes) 840 b. total computer time to build (approximate number of hours) 9.7 hours to create doc vectors from text 458