SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Appendix C: System Features
appendix
National Institute of Standards and Technology
Donna K. Harman
d. are te[OCRerr] positions wi[OCRerr]in d('cuInents stored? no
e. single terms only? yes
C. Data built from sources other th£[OCRerr]i tlie input text yes
1. inteni[OCRerr]'illy-built auxili[OCRerr]iry files yes, a semantic lexicon
[OCRerr]1. domain independent ()[OCRerr] d()m[OCRerr]Lin specific (if two separate files, please fill out one set
of questions for each file) Domain independent
b. type of file (thesaurus, knowledge b£'L[OCRerr]e, lexicon, etc.)
Semantic lexicon l)uilt l)y examination of Roget's Thesaurus.
C. tot[OCRerr]il £unount of storage (megabytes) 0.34 megal)ytes.
d. total number of Concepts represented
There are 36 semantic categories and there are approximately 24,E[OCRerr]0 words
in tw([OCRerr] lexicons with the categories they trigger. The prohahility of each
triggered category is aLso stored.
e. type of representation (fr[OCRerr]es, semantic nets, rules, etc.) It could he viewed as rules.
t.. total computer time to build (approximate number of hours)
(1) if abeady built, how much time to modify for TREC?
Since the 1911 edition of' Roget's Thesaurus hecame pul)lic domain
recently, we spent approximately 16 hours creating the software to
pr([OCRerr]cess the 1911 Thesaurus. Approximately 6 hours of processing
time was required to automatically extract 20,(WO lexicon entries.
However, we did not have time to explore the use of these entries.
g. t()t[OCRerr]'[OCRerr]l niwual tilne to build (approximate number of hours)
(1) if already built, how much t[OCRerr]e to modify for TREC?
Pn()r to TREC, there were 3,(HX[OCRerr] entries in the lexicon established
by manual processing of approximately 6,000 words in 300 hours.
For TREC, we made 1,(HHJ new entries (in 85 hours) by examination
[OCRerr] 1,7([OCRerr]) frequently occurring words found in the training topi[OCRerr]s and
the training text. S([OCRerr], the lexicon we used had 4,(Hlt[OCRerr] entries in it.
Ii. use of manual l[OCRerr]'ib()r
(4) o[OCRerr]er (describe) Refer to (t) and (g).
2. exten'£'dly-built L'1uxili[OCRerr]uy file Ilo
II. Query construction
(please fill out a section for e[OCRerr]'ich query construction method used)
A. Automatically built queries ((`Id hoc) yes
1. topic fields used All fields
2. toL[OCRerr] computer tilne to build query (cpu seconds) 1 second
3. which of the following were used.?
f. tokenizer (recoL'nizes d[OCRerr]tes, phone numbers, [OCRerr]OInlfl()I' patterns)
Dates recognizal)le but not used.
Ii. exp[OCRerr]'ulsi()n of queries using previously-constructed d£[OCRerr]ta structure (from part I)
(1) which structure'? Semantic lexicon described in I.C.1.
j. other (describe) Term weighting based ([OCRerr]n terms in training text.
D. Aut()matically built queries (r()utin[OCRerr]) yes
1. topic fields used All fields
2. total computer tilne to build query (cpu seconds) 1 second.
3. which of tlie following were used in buildin[OCRerr] tlie query'!
a. te[OCRerr]s selected from
(1) topic
b. teun weighting
(2) with weights based on temis in all trainin[OCRerr] docwnents
481