SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Appendix C: System Features
appendix
National Institute of Standards and Technology
Donna K. Harman
System Summary and Timing
GTE Laboratoijes
General Coininents
The fimings should be the tilne to replicate runs from saatch, not including trial runs, etc. The tilnes should also
be reasonably accurate. This soluetilnes will be difficult, such [OCRerr] getting total time for document indexilig of huge
text sections, or m[OCRerr]ually building a knowledge base. Pleise do your best.
I. Construction of indices, knowledge b('Lses, and other datLi structures (ple[OCRerr][OCRerr]se describe all data structures that
your system needs for sea[OCRerr]ching)
A. Which of the following were used to build y[OCRerr]iur d[OCRerr]tta structures?
1. stopword list
a. how muly words in list? 28([OCRerr] words
2. is a controlled v()c[OCRerr]'ibul'iry used? no
3. steinlnin[OCRerr]
a. st[OCRerr]uid[OCRerr]u-d steininin (T L'4g()rithlns
which ones? 1[OCRerr]aice conflation
b. m()1i)h()l()gical £ui[OCRerr]dysis Ilo
4. telin weighting yes
5. phrase discovely Ilo
6. syntactic p[OCRerr][OCRerr];[OCRerr]ing Ilo
7. word 5C115C dis[OCRerr]unbigu[OCRerr]ition ilo
8. heuristic [OCRerr]L[OCRerr]s()ciati()ns n([OCRerr]
9. spelling checking (with m£mu(il colTectioll) ilo
10. spelling conection Ilo
11. proper noun identificition (ilgori flim Ilo
12. tokenizer (recognizes dates, phone numbers, common p[OCRerr]'itterns) Ilo
13. we the m[OCRerr]uilly-indexed te[OCRerr]s used? no
14. other techiuques used to build ckiti structures (brief descuption)
B. Statistics on [OCRerr]iti structures built floin T[OCRerr][OCRerr]C text (ple[OCRerr]'ise fill out each applicable section)
1. inverted index
a. total £`[OCRerr]()unt of storige (ineg[OCRerr]'tbytes) 336([OCRerr] (f[OCRerr][OCRerr]r the 24([OCRerr](J [OCRerr]B [OCRerr]4 text)
b. totil computer time to build ([OCRerr]ppr()x[OCRerr][OCRerr]'ite number of hours) 672
c. is the process completely (`tutolnitic? yes
d. Lue terin positions wi[OCRerr]in d(icuments stoled? yes
e. single terms only? yes
5. other dati structures built flom TREC text (whit?) statistics files
a. total `unount of storige (meg[OCRerr]'ibytes) 400
b. to[OCRerr]l computer time to build ((`ipproxilnate number of hours) 24
c. is the pr(icess completely (`wt()m£'itic? yes
d. brief description of methods used
Index is scanned for fre(luency, location, popularity and record size
statistics. The results are used in normalizing tile weighting attril)utes.
C. Data built from sonices other th'[OCRerr] the input text --no
II. Query construction
(please fill out £1 section for each query construction method used)
513