SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Appendix C: System Features
appendix
National Institute of Standards and Technology
Donna K. Harman
System Summary and Timing
CITRI, Royal Melbourne Institute ol' Technology
We are providing 2 rep[OCRerr][OCRerr]rts oil the system. This is bec[OCRerr]'Luse we have tried experiments oil two very different systems,
and tested quite differeni hypotheses.
ProjecL' retrieval from a compressed daL[OCRerr]b[OCRerr]'[OCRerr]e using the CoSine measure aiid approximate representations of d([OCRerr]ument
lengths
General Comments
The fimings should be the tilne to replicate runs from scratch, not including trial runs, etc. The times should also
be reasonably accurate. This soinetilnes will be diff'icult, such as getting total time for d('cument indexing of huge
text sections, or m[OCRerr]ually building a k'iowledge base. Please do your best.
I. Construction of indices, knowledge kises, [OCRerr]`ind other data structures (please describe all data structures that
your system needs for searching)
A. Which of the f()lk)wing were used to build your data structures'!
1. st()pword list
a. how many words in list'? 42(1
2. is a controlled v([OCRerr]abulary used'! n([OCRerr]
3. stemming
a. stalidard stemming [OCRerr]`Llg(withms
which ones'! I[OCRerr]()vifls' 1968 algorithm
b. morphological (`[OCRerr](`ilysis no
4. tenn weighting tf.idf
5. phrase discovery
a. what kind of phrase? Adjacent pairs
b. using statistical Ineth(ids yes
C. using syiltactic methods n([OCRerr]
6. syntactic parsin[OCRerr] no
7. word sense disambiguation 110
8. heuristic associat[OCRerr]ns no
9. spelling checking (with manual con-ection) (lueries only
10. spelling correction queries only
11. proper noun identification £LIg()rithm no
12. tokenizer (recognizes dates, phone numbers, common patterns) no
13. are the m£'wually-indexed terms used'! they were not discarded
14. other techniques used to build d,'ita structures (brief description)
B. Statistics on data structures built from Tl[OCRerr]C text (please fill out each applicable section)
2. n-grams, suffix arrays, si[OCRerr][OCRerr]nature t'iles
a. total alnount of st()r'[OCRerr]L'C (me[OCRerr]abytes)
Data (compressed) 220m
Index 313m
b. total computer time to build (approxil nate number of hours) 23 hrs
c. brief description of methods used niulti-organisational signature FILE
d. is the process completely aut()m[OCRerr]-Itic? yes
C. Data built from sourCes oilier th&.ui the input text --no
490