SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Appendix C: System Features
appendix
National Institute of Standards and Technology
Donna K. Harman
System Summary and Timing
Universitaet Dortmund
Automatic routing (RPI feedback)
General Coininents
The timings should be the time to replicate runs from scr'[OCRerr]tch, not including trial runs, etc. The tilnes should also
be re[OCRerr][OCRerr]onably accurate. This sometilnes will be difficult, such as gettilig total time ft)r d&[OCRerr]ument jiidexilig of huge
text sections, or m[OCRerr]uiually buildilig a kiiowledge base. Please do your best.
I. Construction of indices, knowledge bases, and other dattt structures (please describe all data structures that
your system needs t;()r searching)
A. Which of the ft)llowin(T were used 10 build your data structures?
1. st()pword list
a. how many words in list? 57([OCRerr]
2. is a controlled v([OCRerr]abul[OCRerr]u[OCRerr]y used? no
3. stelnilling yes
a. st£[OCRerr]idard stemming [OCRerr]`tlg()n thins
which ones? S[OCRerr][OCRerr]ART
b. m()1[OCRerr]h()l()gic£'1l [OCRerr][OCRerr]alysis
4. 1dm weighting
In docs + queries, tt. * idt; cosine normalization (ntc) (in docs idf is l)ased on
collection frequency within doc set Dl only)
5. phrase discovery no
6. syntactic parsing no
7. word sense dis[OCRerr][OCRerr]nbiguation n([OCRerr]
8. heuristic associations no
9. spelling checking (with manual correction) no
10. spelling correction no
11. proper noun identification algorithm n(i
12. tokenizer (rec()L'nizes d[OCRerr]tes, phone numbers, CoifliflOli patterils) no
13. are the m£'uiu£.illy-indexed terins used? Ilo
14. other techniques used to build [OCRerr]ta structures (bnef description) no
B. S[OCRerr]itistics on data structwes built from Tl[OCRerr]EC text (please fill out each applicable section)
1. inverted index
a. total [OCRerr]`uli()unt of stor£'ige (ineg[OCRerr]ibytes) 275
b. totil computer tilne to build (approxilnate number of hours)
1.9 hours (not including tllue to index Dl to o')tain collection frequency info)
c. is the pr(x:ess completely [OCRerr]`Lut()ln'Ltic? yes
d. (`ne term positions wi[OCRerr]in (1(iculnents stored? Ilo
e. single terms only? yes
5. other dali structures built from Tl[OCRerr]EC text (wh[OCRerr]'it?)
Map from dodd to text location (also gives title for each doc)
[OCRerr]`i. total £`ui'ount of st()r£'ige (megabytes) 24 M')ytes.
b. t()[OCRerr]l computer tilfle to build (approxilnate number of hours)
Tinie t([OCRerr] create included in inverted tile creation ahove.
c. is the pr[OCRerr]'cess completely (`tutomatic? yes
other data structures built from TREC text (what?)
461