SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Appendix C: System Features appendix National Institute of Standards and Technology Donna K. Harman System Summary and Timing CITRI, Royal Melbourne Institute ol' Technology We are providing 2 rep[OCRerr][OCRerr]rts oil the system. This is bec[OCRerr]'Luse we have tried experiments oil two very different systems, and tested quite differeni hypotheses. ProjecL' retrieval from a compressed daL[OCRerr]b[OCRerr]'[OCRerr]e using the CoSine measure aiid approximate representations of d([OCRerr]ument lengths General Comments The fimings should be the tilne to replicate runs from scratch, not including trial runs, etc. The times should also be reasonably accurate. This soinetilnes will be diff'icult, such as getting total time for d('cument indexing of huge text sections, or m[OCRerr]ually building a k'iowledge base. Please do your best. I. Construction of indices, knowledge kises, [OCRerr]`ind other data structures (please describe all data structures that your system needs for searching) A. Which of the f()lk)wing were used to build your data structures'! 1. st()pword list a. how many words in list'? 42(1 2. is a controlled v([OCRerr]abulary used'! n([OCRerr] 3. stemming a. stalidard stemming [OCRerr]`Llg(withms which ones'! I[OCRerr]()vifls' 1968 algorithm b. morphological (`[OCRerr](`ilysis no 4. tenn weighting tf.idf 5. phrase discovery a. what kind of phrase? Adjacent pairs b. using statistical Ineth(ids yes C. using syiltactic methods n([OCRerr] 6. syntactic parsin[OCRerr] no 7. word sense disambiguation 110 8. heuristic associat[OCRerr]ns no 9. spelling checking (with manual con-ection) (lueries only 10. spelling correction queries only 11. proper noun identification £LIg()rithm no 12. tokenizer (recognizes dates, phone numbers, common patterns) no 13. are the m£'wually-indexed terms used'! they were not discarded 14. other techniques used to build d,'ita structures (brief description) B. Statistics on data structures built from Tl[OCRerr]C text (please fill out each applicable section) 2. n-grams, suffix arrays, si[OCRerr][OCRerr]nature t'iles a. total alnount of st()r'[OCRerr]L'C (me[OCRerr]abytes) Data (compressed) 220m Index 313m b. total computer time to build (approxil nate number of hours) 23 hrs c. brief description of methods used niulti-organisational signature FILE d. is the process completely aut()m[OCRerr]-Itic? yes C. Data built from sourCes oilier th&.ui the input text --no 490