SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Appendix C: System Features appendix National Institute of Standards and Technology Donna K. Harman System Summary and Timing ConQuest Software, Inc. General Comments The timings should be the time to replicate runs from scratch, not including trial runs, etc. The times should also be reasonably accurate. This sometimes will be dimcult, such as getting total time for document indexing of huge text sections, or mailually building a lalowledge base. Please do your best. I. Construction of indices, kuowledge bŁ[OCRerr]';es, and other data structures (ple[OCRerr][OCRerr]se describe your system needs for seŁ[OCRerr]ching) all data structures that A. Which of the following were used to build your data structures? 1. stopword list yes a. how many words in list? 70 2. is a controlled v([OCRerr]abul[OCRerr]lry used? no 3. stelnining a. stŁ..uid[OCRerr]ud stemming [OCRerr]dg()ritl)nlN 110 b. In()rph()lo('ical ŁulLdysis yes 4. (Cflfl weighting yes 5. phrase discovery yes a. what kind of phr[OCRerr][OCRerr]e? I)araphr[OCRerr]se of Query b. usilig statistical meth(XIs Statistical proximity match c. using syiltactic methods Limited 6. syntactic parsing Linilted--PoS assignment 7. word sense disainbiguation In query hy user, & in explosion of terms 8. heuristic associations yes a. short definition of these associations Terms associated via semantic net 9. spelling checkin(2 (with manual correction) In query only 10. spelling correction no 11. proper noun identification [OCRerr][OCRerr]dg()rithIn If identitied l)y lexicon 12. tokenizer (recognizes dates, phone numbers, common pattenis) a. which pattenis are tokenized? Many 13. are the m[OCRerr]'[OCRerr]ually-indexed terins used? no 14. other techniques used to build d[OCRerr]ta structures (brief description) Index organized hierarchically so that best documents (based on a coarse grained ranking algorithm) are returned to user while search continues on very large databases. Linked lists are used to connect and identify idioms. Semantic network term explosion is c([OCRerr]ntr()lIed by "weighted" links where weights are selected as either numerical or fuzzy sets based upon the link source and relatio[OCRerr]ship. B. Statistics on d[OCRerr]ta structures built from TREC text (please till oUt each applicable section) 1. inverted index a. total [OCRerr]unount of stonige (me&iabytes) 1.2 Gb for 2.3 Gb text, 52% b. total computer tune to build (approximate number of hours) 150 c. is the pr&[OCRerr]ess completely automatic? yes if not, appmximately how many hours of manual labor? Setup--4 hours d. are term positions within d(icuments stored? yes C. single tenils only? no 3. knowledge bases a. total ainount of storage (meLYabytes) 12 Mbytes 502