SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Appendix C: System Features appendix National Institute of Standards and Technology Donna K. Harman System Summary and Timing CITRI, Royal Melbourne Institute of Technology We are providing 2 reports oil the systeni. This is becLiuse we have tried experiments on Iwo very different systems, and tested quite different hypotlieses. Project: retrieval from a compressed datahL'b'e using the CoSine measure £[OCRerr]id approximate representations of d&icument lengths General CommenL[OCRerr] The timings should be the time to replicate runs from scratch, not including trial runs, etc. The times should also be re[OCRerr]sonably accurate. This sometimes will be difficult, such [OCRerr]`; getting tot[OCRerr][OCRerr] time f[OCRerr])r d(icument indexing of huge text sections, or manuilly building a kiiowledge base. Please do your best. I. Construction of indices, knowledge kises, and other data structures (please describe all data structures that your system needs for se([OCRerr]chiIi[OCRerr]) A. Which of the following were used to build your data structures? 1. stopword list no 2. is a controlled v('cabul[OCRerr]iry used? no 3. sten[OCRerr]ninL' yes, tor Construction (jf index a. staiid[OCRerr]ird stemming [OCRerr]d gori [OCRerr]ins which ones'! 1[OCRerr]()vins' 1968 algorithm 4. tenn weighting no 5. phrase discovery OE) 6. syntactic p[OCRerr]siIlg n([OCRerr] 7. word sense dis[OCRerr]bit[OCRerr]uL1ti()n no 8. heuristic [OCRerr][OCRerr]5sociations Ilo 9. spelling checking (with manual correction) n(j 10. spelling correction 110 11. proper noun identification Lilgoritlim no 12. tokenizer (recognizes dates, phone numbers, COifliflOli patterlis) no 13. are the manually-indexed terms used? no 14. other techniques used to build data structures (brief description) no, Ilut see discussion ([OCRerr]f compression below B. Statistics on ddata structures built fi-om TREC text (please fill out each applicable section) 1. inverted index a. total ainount of storage (Jne([OCRerr]Tabytes) 5([OCRerr].7 Ml) (37.9 MI) f[OCRerr][OCRerr]r pointers, 12.8 MI) for fre(1uencies) b. total computer time to build (approximate number of hours) 4.20 CPU hours, ()flCC a vocabulary has I)een huilt c. is the pr&[OCRerr]ess completely automatic? yes d. are term positions wi[OCRerr]in d([OCRerr]uInents stored? no, l)ut term frequency within document is stored C. single terms only? yes 5. other data structures built from TREC text (what?) model for sul)se(1uent c()nlpressi([OCRerr]n of text a. total ainount of storage (Ine([OCRerr]Tabytes) 2.4 Ml) b. total computer time to build (approxu nate number of hours) 2.54 hours c. is the pr([OCRerr]ess completely automatic? yes 487