SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Appendix C: System Features appendix National Institute of Standards and Technology Donna K. Harman System Summary and Timing Universitaet Dortmund Single term automatic ad hoc run (Fuhi. 1ea[OCRerr]ing) General CoininenLs The timin[OCRerr],'s should be the time to replicate runs from scr[OCRerr]itch, not including trial runs, etc. The times should also be reasonably accurate. This sometimes will be dimcult, such as getting total time f[OCRerr])r d([OCRerr]ument indexing of huge text sections, or mailually building a kliowledge base. Please do your best. I. Construction of iiidices, knowledge b[OCRerr][OCRerr]es, and other data structures (please describe all data structures that your system needs for se[OCRerr]'irchin[OCRerr]) A. Which of the following were used to build your data structures? 1. stopword list a. how many words iii lisi? 57([OCRerr] 2. is a controlled v([OCRerr]abuI[OCRerr]Lry used? no 3. ste1111nifl[OCRerr],' yes a. st.'uid[OCRerr]ird ste[nlnint? aigon. thins which ones? SMART b. morphological aiialysis 4. tenn weighting In docs, linear c()nll)inati()n of several factors In [OCRerr]iueries, tf * i(1t; COsIfle nornlalizati([OCRerr]n (ntc) 5. phrase discovery no 6. syntactic PlrS[OCRerr]Il[OCRerr]r1 flO 7. word sense dis[OCRerr][OCRerr]nbiguation no 8. heuristic ass&[OCRerr]iatk)ns no 9. spelling checking (with manual correction) no 10. spelling correction no 11. proper noun identification algorithm no 12. tokenizer (recognizes dates, phone numbers, coininon patterns) no 13. are the maiiually-indexed terms used? no 14. other techniques used to build data structures (brief description) Coefficients for linear coml)inations used in weighting were determined automatically using QI,Dl4udgrnenis of QI ([OCRerr]n Dl. This to[OCRerr][OCRerr] 1.7 hours (not including 2.6 hours to index Q1,DI). B. Statistics on data structures built. from TREC text (please fill out each applicable section) 1. inverted index a. to[OCRerr]Ll [OCRerr]unount of stor[OCRerr]ige (me[OCRerr][OCRerr]Lbytes) 69([OCRerr] b. total computer time to build (approximate number of hours) 4.7 hours to create doc vectors from text 1.7 hours to reweight doc vectors and pr(KIuce inverted flle c. is the pr([OCRerr]ess completely [OCRerr]`[OCRerr]utomatic? yes d. [OCRerr] term position5 within d([OCRerr]uments stored? no e. single terms only? yes 5. other data structures built from TREC text (what?) Map from d([OCRerr]id to text location (als([OCRerr] gives title for each doc) a. total ainount of stor[OCRerr]ige (megabytes) 68 MI)ytes. b. total computer time to build (approximate number of hours) 455