SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Appendix C: System Features appendix National Institute of Standards and Technology Donna K. Harman System Summary and Timing GTE Laboratoijes General Coininents The fimings should be the tilne to replicate runs from saatch, not including trial runs, etc. The tilnes should also be reasonably accurate. This soluetilnes will be difficult, such [OCRerr] getting total time for document indexilig of huge text sections, or m[OCRerr]ually building a knowledge base. Pleise do your best. I. Construction of indices, knowledge b('Lses, and other datLi structures (ple[OCRerr][OCRerr]se describe all data structures that your system needs for sea[OCRerr]ching) A. Which of the following were used to build y[OCRerr]iur d[OCRerr]tta structures? 1. stopword list a. how muly words in list? 28([OCRerr] words 2. is a controlled v()c[OCRerr]'ibul'iry used? no 3. steinlnin[OCRerr] a. st[OCRerr]uid[OCRerr]u-d steininin (T L'4g()rithlns which ones? 1[OCRerr]aice conflation b. m()1i)h()l()gical £ui[OCRerr]dysis Ilo 4. telin weighting yes 5. phrase discovely Ilo 6. syntactic p[OCRerr][OCRerr];[OCRerr]ing Ilo 7. word 5C115C dis[OCRerr]unbigu[OCRerr]ition ilo 8. heuristic [OCRerr]L[OCRerr]s()ciati()ns n([OCRerr] 9. spelling checking (with m£mu(il colTectioll) ilo 10. spelling conection Ilo 11. proper noun identificition (ilgori flim Ilo 12. tokenizer (recognizes dates, phone numbers, common p[OCRerr]'itterns) Ilo 13. we the m[OCRerr]uilly-indexed te[OCRerr]s used? no 14. other techiuques used to build ckiti structures (brief descuption) B. Statistics on [OCRerr]iti structures built floin T[OCRerr][OCRerr]C text (ple[OCRerr]'ise fill out each applicable section) 1. inverted index a. total £`[OCRerr]()unt of storige (ineg[OCRerr]'tbytes) 336([OCRerr] (f[OCRerr][OCRerr]r the 24([OCRerr](J [OCRerr]B [OCRerr]4 text) b. totil computer time to build ([OCRerr]ppr()x[OCRerr][OCRerr]'ite number of hours) 672 c. is the process completely (`tutolnitic? yes d. Lue terin positions wi[OCRerr]in d(icuments stoled? yes e. single terms only? yes 5. other dati structures built flom TREC text (whit?) statistics files a. total `unount of storige (meg[OCRerr]'ibytes) 400 b. to[OCRerr]l computer time to build ((`ipproxilnate number of hours) 24 c. is the pr(icess completely (`wt()m£'itic? yes d. brief description of methods used Index is scanned for fre(luency, location, popularity and record size statistics. The results are used in normalizing tile weighting attril)utes. C. Data built from sonices other th'[OCRerr] the input text --no II. Query construction (please fill out £1 section for each query construction method used) 513