SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Appendix C: System Features
appendix
National Institute of Standards and Technology
Donna K. Harman
System Summary and Timing
Universitaet Dortmtind
Phrase automatic ad hoc (Fuhr leariling)
General Cominent.[OCRerr]
The fiming.[OCRerr] .[OCRerr]hould be the time to replicate run[OCRerr] from scratch, not including trial runs, etc. The times should also
be reasonably accurate. This sometimes will be ditTicult, such as getting total time ft)r d()(:ument indexing of huge
text sections, or mailually building a [OCRerr]owledge base. Please do your best.
I. Construction of indices, kfl()wlCdLTe bases, and other data structures (please describe all data structures that
your system needs fi)r seŁircliin[OCRerr])
A. Which of the following were used to build your data structures?
1. stopword list
a. how many words in list? 570
2. is a coiltrolled v([OCRerr]abul[OCRerr]iry used?
Not f[OCRerr])r single tern's.
A phrase list was aut([OCRerr]rnatically c([OCRerr]nstructed from phrases occurring 25 times or more
In the first doc set (Dl). ouly those phrases were used.
3. ste1[OCRerr]ing yes
a. st[OCRerr]d[OCRerr]ud stelnining algorithms
which ones? SMART
b. morphological alialysis
4 tenn weighting
In docs, linear c()ml)inati()n of several factors
In (lueries, tf * idf, c(jsine normalization (ntc)
5. phrase discovery
a. what kind of phrase?
Adjacent n(Jn-st()pwords, comp()nenLs stemmed, that occurred at least 25
times In the Dl document set.
b. using statistical meth(ids
c. using syiltactic methods
6. syiltactic p([OCRerr]sing no
7. word sense disambiguation no
8. heuristic associations no
9. spelling checking (with manual correction) no
10. spelling correction [OCRerr]
11. proper Iloull identification algorithm no
12. tokenizer (reco(2nizes dŁttes, phone numbers, cominon patterns) no
13. [OCRerr]ire the m[OCRerr][OCRerr]ually-indexed terms used? no
14. other techniques used to build data structures (brief description)
Coefficients f[OCRerr])r linear c([OCRerr]mhinati()Ils used in weighting were determined automatically
using QI,DI,judgments of QI ([OCRerr]n DI. This took 2.4 hours (not including 5.6 hours
to index QI,Dl).
B. Statistics on data structures built from TREC text (please fill out each applicable section)
I. inverted index
a. total amomit of storage (megabytes) 840
b. total computer time to build (approximate number of hours)
9.7 hours to create doc vectors from text
458