SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Appendix C: System Features
appendix
National Institute of Standards and Technology
Donna K. Harman
System Summary and Timing
University of Central Florida
General Conunents
The fimings should be the tune to replicate runs [OCRerr]m scratch, not including trial runs, etc. The times should also
be re[OCRerr][OCRerr]onably accurate. This sometunes will be difficult, such as getting total time ror document indexing of huge
text sections, or m£'uiu£..llly build[OCRerr]ig a kilowledge base. Please do your best.
I. Construction of indices, knowledge bases, and other data structures (please describe all data structures that
your system needs for se[OCRerr]lrChin(J)
A. Which of tlie folk)wing were used to build your data structures?
1. stopword list yes
a. how many words in list?
166 stop words, 122 al)l)reviati()ns, 47 hyphenated words, 24 entries for
al)I)reviati()ns and alternate n()ti()ns for months, 35 entries for legitimate
words `lot to Ile prefixed, and 6 entries for legitimate pretixes.
2. is [OCRerr]t coutrolled v(icabul('uy used? Il([OCRerr]
3. stemmin[OCRerr]' yes
a. st'wd£u-d stemming alg()ri[OCRerr]Ins
which ones? .J.B. I[OCRerr]()vins' Stemming Algorithm (nl()dltied).
b. m()1[OCRerr]h()l()gical [OCRerr]uialysis Ilolle
4. telin weighting yes
5. phrase discovery [OCRerr]
6. syntactic p£Lr5i11(2 no
7. word sense dis[OCRerr]nbigu'.1ti()n
Yes. The semantic lexicon we used is l)ased ()[OCRerr] word senses f()und in Roget's
Thesaurus.
8. heuristic ass()ci[OCRerr]Lti()ns Ilo
9. spelling checking (with inanutI c()11[OCRerr]ecti()n) no
10. spellin(2 corlection no
11. proper IloUII identification algoritlim Ilo
12. tokenizer (recognizes dites, phone numbers, coirunon patterns) yes
a. which patterns £`ue tokenized?
The QA System recognizes dates. But we felt it was not useful f[OCRerr])r the NIST
experiment so we removed this feature to improve text processing speed.
13. [OCRerr]`[OCRerr]re the m[OCRerr]'uiu£illy-indexed tenns used? 110
14. other techniques used to build d[OCRerr]ta structures (brief descuption)
The QA System uses B-tree storage structures ti)r inverted index tile access and
semantic lexicon access. But for the NIST experiments, we used the QA System text
scanning al)ility and coupled it with hash tal)le access (replacing tile B-tree access)
and the use of 32-l)it Codes for text strings.
B. Statistics on data structures built from TREC text (please fill out each applicable section)
1. inverted index yes
a. tOtLil [OCRerr]un()unt of st()ra[OCRerr]e (meg[OCRerr]ibytes)
For Vol.1 the index storage was 385 megahytes.
b. total computer tune to build (approx[OCRerr]nate number of hours)
73 hours using nine IBM 5([OCRerr] MHz 486 PCs running in parallel.
c. is the Pr([OCRerr]C55 completely automatic? yes
480