NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)

SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Appendix C: System Features appendix National Institute of Standards and Technology Donna K. Harman System Summary and Timing Advanced Decision Systems General Coininents The timings should be the tjine to replicate runs from scratch, not including trial runs, etc. The times should also be reasonably accurate. This sometimes will be difficult, such as [OCRerr]ettin[OCRerr] total time f[OCRerr])r d(lcument indexing of huge text sections, or mai[OCRerr]u(..11ly buildin[OCRerr][OCRerr]1 a k'iowledge base. Please do your best. I. Construction of indices, knowledge bases, and other data structures (please describe all your system needs for se[OCRerr].irchin(T) data structures that A. Which of the followin(T were used to build your data structures? 1. stopword list yes a. how many words in list? 421 2. is a controlled voCabul[OCRerr]y used? no 3. stenYInin[OCRerr] [OCRerr]() 4. tenn weighting [OCRerr]() 5. phrase discovely no 6. syntactic p'Ifsing no 7. word sense disambigultion [OCRerr]() 8. heuristic ΩOCRerr]ss()ci[OCRerr]'1ti()ns Ilo 9. spelling checking (with manuil c()1[OCRerr]ection) Ilo 10. spelling ColTeCtion no 11. proper noun identificΩOCRerr]ti()n [OCRerr]d[OCRerr]()ritl1In no 12. tokenizer (rec()t2nizes dates, phone numbers, coinini)n patterns) no 13. [OCRerr]LrC the mΩOCRerr]u[OCRerr]illy-indexed terms used? [OCRerr]() 14. other techniques used to build data structures (brief description) original documents and yes--1)inary classitication trees Iluilt automatically from the topic statements B. Statistics on dξta structures built from TREC text (please fill out each applicable seCtk)n) 5. other data structures built from TREC text (what?) yes[OCRerr][OCRerr]classificati()n vectors; actually integer arrays a. total [OCRerr]`yln()unt of st()rΩOCRerr]ge (megibytes) Only a few Kl)ytes fi)r the training sets used f()r tile oflicial scores--vectors generated oil the tly for routing the test data. b. tOtLtl computer time to build (approximate number of hours) Feature extraction takes less than lo seconds per document. c. is the pr(xess completely [OCRerr]iutomatic? yes d. brief description of methods used Give a specification of a set of features, fl)r example, a list of word tokens; tile docunlent is searched fi)r the nunliler of ([OCRerr]currences of each feature. C. Data built from sources other th[OCRerr]w the input text --110 II. Query construction (please fill out [OCRerr] section for e[OCRerr]'ich query construction method used) D. Automatically built queries (routing) 484