SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Appendix C: System Features
appendix
National Institute of Standards and Technology
Donna K. Harman
System Summary and Timing
Universitaet Dortmund
Single term automatic ad hoc run (Fuhi. 1ea[OCRerr]ing)
General CoininenLs
The timin[OCRerr],'s should be the time to replicate runs from scr[OCRerr]itch, not including trial runs, etc. The times should also
be reasonably accurate. This sometimes will be dimcult, such as getting total time f[OCRerr])r d([OCRerr]ument indexing of huge
text sections, or mailually building a kliowledge base. Please do your best.
I. Construction of iiidices, knowledge b[OCRerr][OCRerr]es, and other data structures (please describe all data structures that
your system needs for se[OCRerr]'irchin[OCRerr])
A. Which of the following were used to build your data structures?
1. stopword list
a. how many words iii lisi? 57([OCRerr]
2. is a controlled v([OCRerr]abuI[OCRerr]Lry used? no
3. ste1111nifl[OCRerr],' yes
a. st.'uid[OCRerr]ird ste[nlnint? aigon. thins
which ones? SMART
b. morphological aiialysis
4. tenn weighting
In docs, linear c()nll)inati()n of several factors
In [OCRerr]iueries, tf * i(1t; COsIfle nornlalizati([OCRerr]n (ntc)
5. phrase discovery no
6. syntactic PlrS[OCRerr]Il[OCRerr]r1 flO
7. word sense dis[OCRerr][OCRerr]nbiguation no
8. heuristic ass&[OCRerr]iatk)ns no
9. spelling checking (with manual correction) no
10. spelling correction no
11. proper noun identification algorithm no
12. tokenizer (recognizes dates, phone numbers, coininon patterns) no
13. are the maiiually-indexed terms used? no
14. other techniques used to build data structures (brief description)
Coefficients for linear coml)inations used in weighting were determined automatically
using QI,Dl4udgrnenis of QI ([OCRerr]n Dl. This to[OCRerr][OCRerr] 1.7 hours (not including 2.6 hours
to index Q1,DI).
B. Statistics on data structures built. from TREC text (please fill out each applicable section)
1. inverted index
a. to[OCRerr]Ll [OCRerr]unount of stor[OCRerr]ige (me[OCRerr][OCRerr]Lbytes) 69([OCRerr]
b. total computer time to build (approximate number of hours)
4.7 hours to create doc vectors from text
1.7 hours to reweight doc vectors and pr(KIuce inverted flle
c. is the pr([OCRerr]ess completely [OCRerr]`[OCRerr]utomatic? yes
d. [OCRerr] term position5 within d([OCRerr]uments stored? no
e. single terms only? yes
5. other data structures built from TREC text (what?)
Map from d([OCRerr]id to text location (als([OCRerr] gives title for each doc)
a. total ainount of stor[OCRerr]ige (megabytes) 68 MI)ytes.
b. total computer time to build (approximate number of hours)
455