SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Appendix C: System Features
appendix
National Institute of Standards and Technology
Donna K. Harman
System Summary and Timing
ConQuest Software, Inc.
General Comments
The timings should be the time to replicate runs from scratch, not including trial runs, etc. The times should also
be reasonably accurate. This sometimes will be dimcult, such as getting total time for document indexing of huge
text sections, or mailually building a lalowledge base. Please do your best.
I. Construction of indices, kuowledge bŁ[OCRerr]';es, and other data structures (ple[OCRerr][OCRerr]se describe
your system needs for seŁ[OCRerr]ching)
all data structures that
A. Which of the following were used to build your data structures?
1. stopword list yes
a. how many words in list? 70
2. is a controlled v([OCRerr]abul[OCRerr]lry used? no
3. stelnining
a. stŁ..uid[OCRerr]ud stemming [OCRerr]dg()ritl)nlN 110
b. In()rph()lo('ical ŁulLdysis yes
4. (Cflfl weighting yes
5. phrase discovery yes
a. what kind of phr[OCRerr][OCRerr]e? I)araphr[OCRerr]se of Query
b. usilig statistical meth(XIs Statistical proximity match
c. using syiltactic methods Limited
6. syntactic parsing Linilted--PoS assignment
7. word sense disainbiguation In query hy user, & in explosion of terms
8. heuristic associations yes
a. short definition of these associations Terms associated via semantic net
9. spelling checkin(2 (with manual correction) In query only
10. spelling correction no
11. proper noun identification [OCRerr][OCRerr]dg()rithIn If identitied l)y lexicon
12. tokenizer (recognizes dates, phone numbers, common pattenis)
a. which pattenis are tokenized? Many
13. are the m[OCRerr]'[OCRerr]ually-indexed terins used? no
14. other techniques used to build d[OCRerr]ta structures (brief description)
Index organized hierarchically so that best documents (based on a coarse grained
ranking algorithm) are returned to user while search continues on very large
databases. Linked lists are used to connect and identify idioms. Semantic network
term explosion is c([OCRerr]ntr()lIed by "weighted" links where weights are selected as either
numerical or fuzzy sets based upon the link source and relatio[OCRerr]ship.
B. Statistics on d[OCRerr]ta structures built from TREC text (please till oUt each applicable section)
1. inverted index
a. total [OCRerr]unount of stonige (me&iabytes) 1.2 Gb for 2.3 Gb text, 52%
b. total computer tune to build (approximate number of hours) 150
c. is the pr&[OCRerr]ess completely automatic? yes
if not, appmximately how many hours of manual labor? Setup--4 hours
d. are term positions within d(icuments stored? yes
C. single tenils only? no
3. knowledge bases
a. total ainount of storage (meLYabytes) 12 Mbytes
502