SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Appendix C: System Features
appendix
National Institute of Standards and Technology
Donna K. Harman
System Summary and Timing
University of Illinois at Chicago
General Coininents
The timin[OCRerr][OCRerr]s should be the tilne to replicate fulls from scratch, not including trial fulls, etc. The tilnes should also
be reasonably accurate. This sometilnes will be difficult, such as getting total time for d&[OCRerr]ument indexing of huge
text sections, or maiiually building a kuowledge base. Please do yoUr best.
Construction of indices, knowledge b[OCRerr]'i[OCRerr]es. and other data structures (please describe all data structures that
your system needs for se[OCRerr]'irchi'ig)
Each document is represented as a set of word pairs. Pairs were formed from all adjacent
words, plus all words separated Ily ()flC and two intermediate words. Documents were the
unit of ()rgani7[OCRerr]ti([OCRerr]n f([OCRerr]r the data structure. If a pair occurred only once in a document it was
dropped from the data structure for that document [OCRerr][OCRerr]nly.
A sample record is as f(JII()ws:
MULTIMEDIA ENCYCLOPEDIA 2 WSj88081S-E[OCRerr]14
The numl)er of times the pair occurred in tile document appears in the third field, just l)efore
the document id.
A. Which of the following were used to build your data structures'?
1. stopword list
The stopword list fr([OCRerr]m SMART versi([OCRerr]n 10 was used. Some additional stop words
from TREC markup codes were used.
a. how many words in list'? The total size of the stoplist was 631 words.
2. is a controlled vocabul£'iry used'? none
3. stelnining none
a. st[OCRerr]'rnd[OCRerr]'[OCRerr]d stemming algorithms
which ones'!
Some small stemming experimenLs were later perf(wmed using the code from
SMART versi[OCRerr][OCRerr]n 10 and three training (lueries. For ([OCRerr]uery 002 stemming
had n([OCRerr] effect, while t')r ([OCRerr]uery ([OCRerr]6 it resulted in a 43% increase in recall,
and f([OCRerr]r ([OCRerr]uery ([OCRerr]9 a 73% impr([OCRerr]vement in recall.
b. InoIi,li()loI[OCRerr]ic[OCRerr]'[OCRerr]l [OCRerr]`ui[OCRerr]'ilysis Ilolle
4. tenn weighting
None. Weighting w[OCRerr][OCRerr]' planned Ilut could not l)e implemented given limitations that
arose.
5. phrase discovery
a. what kind of phr[OCRerr]'tse?
Word pairs occurring within three word positions Of one another.
b. usin(g st[OCRerr]'[OCRerr]istical ineth(Kls All such pairs were identified.
c. usin[OCRerr] sylitaclic methods
6. syntactic p[OCRerr]'u'sint,' none
7. word sense dis[OCRerr]nbiguation fl()flC
8. heuristic aNsociations
a. short definition of these £`L';s()ci[OCRerr]'[OCRerr]ti()IIs Only the l)asic pairing ass(iciatioIL[OCRerr] used.
9. spellinL' checkinLY (with manu£'il correction) none
10. spellin[OCRerr] correction none
464