SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Appendix C: System Features
appendix
National Institute of Standards and Technology
Donna K. Harman
System Summary and Timing
CITRI, Royal Melbourne Institute of Technology
We are providing 2 reports oil the systeni. This is becLiuse we have tried experiments on Iwo very different systems,
and tested quite different hypotlieses.
Project: retrieval from a compressed datahL'b'e using the CoSine measure £[OCRerr]id approximate representations of d&icument
lengths
General CommenL[OCRerr]
The timings should be the time to replicate runs from scratch, not including trial runs, etc. The times should also
be re[OCRerr]sonably accurate. This sometimes will be difficult, such [OCRerr]`; getting tot[OCRerr][OCRerr] time f[OCRerr])r d(icument indexing of huge
text sections, or manuilly building a kiiowledge base. Please do your best.
I. Construction of indices, knowledge kises, and other data structures (please describe all data structures that
your system needs for se([OCRerr]chiIi[OCRerr])
A. Which of the following were used to build your data structures?
1. stopword list no
2. is a controlled v('cabul[OCRerr]iry used? no
3. sten[OCRerr]ninL' yes, tor Construction (jf index
a. staiid[OCRerr]ird stemming [OCRerr]d gori [OCRerr]ins
which ones'! 1[OCRerr]()vins' 1968 algorithm
4. tenn weighting no
5. phrase discovery OE)
6. syntactic p[OCRerr]siIlg n([OCRerr]
7. word sense dis[OCRerr]bit[OCRerr]uL1ti()n no
8. heuristic [OCRerr][OCRerr]5sociations Ilo
9. spelling checking (with manual correction) n(j
10. spelling correction 110
11. proper noun identification Lilgoritlim no
12. tokenizer (recognizes dates, phone numbers, COifliflOli patterlis) no
13. are the manually-indexed terms used? no
14. other techniques used to build data structures (brief description)
no, Ilut see discussion ([OCRerr]f compression below
B. Statistics on ddata structures built fi-om TREC text (please fill out each applicable section)
1. inverted index
a. total ainount of storage (Jne([OCRerr]Tabytes)
5([OCRerr].7 Ml) (37.9 MI) f[OCRerr][OCRerr]r pointers, 12.8 MI) for fre(1uencies)
b. total computer time to build (approximate number of hours)
4.20 CPU hours, ()flCC a vocabulary has I)een huilt
c. is the pr&[OCRerr]ess completely automatic? yes
d. are term positions wi[OCRerr]in d([OCRerr]uInents stored?
no, l)ut term frequency within document is stored
C. single terms only? yes
5. other data structures built from TREC text (what?)
model for sul)se(1uent c()nlpressi([OCRerr]n of text
a. total ainount of storage (Ine([OCRerr]Tabytes) 2.4 Ml)
b. total computer time to build (approxu nate number of hours) 2.54 hours
c. is the pr([OCRerr]ess completely automatic? yes
487