NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)

SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Appendix C: System Features appendix National Institute of Standards and Technology Donna K. Harman System Summary and Timing University of Illinois at Chicago General Coininents The timin[OCRerr][OCRerr]s should be the tilne to replicate fulls from scratch, not including trial fulls, etc. The tilnes should also be reasonably accurate. This sometilnes will be difficult, such as getting total time for d&[OCRerr]ument indexing of huge text sections, or maiiually building a kuowledge base. Please do yoUr best. Construction of indices, knowledge b[OCRerr]'i[OCRerr]es. and other data structures (please describe all data structures that your system needs for se[OCRerr]'irchi'ig) Each document is represented as a set of word pairs. Pairs were formed from all adjacent words, plus all words separated Ily ()flC and two intermediate words. Documents were the unit of ()rgani7[OCRerr]ti([OCRerr]n f([OCRerr]r the data structure. If a pair occurred only once in a document it was dropped from the data structure for that document [OCRerr][OCRerr]nly. A sample record is as f(JII()ws: MULTIMEDIA ENCYCLOPEDIA 2 WSj88081S-E[OCRerr]14 The numl)er of times the pair occurred in tile document appears in the third field, just l)efore the document id. A. Which of the following were used to build your data structures'? 1. stopword list The stopword list fr([OCRerr]m SMART versi([OCRerr]n 10 was used. Some additional stop words from TREC markup codes were used. a. how many words in list'? The total size of the stoplist was 631 words. 2. is a controlled vocabulŁ'iry used'? none 3. stelnining none a. st[OCRerr]'rnd[OCRerr]'[OCRerr]d stemming algorithms which ones'! Some small stemming experimenLs were later perf(wmed using the code from SMART versi[OCRerr][OCRerr]n 10 and three training (lueries. For ([OCRerr]uery 002 stemming had n([OCRerr] effect, while t')r ([OCRerr]uery ([OCRerr]6 it resulted in a 43% increase in recall, and f([OCRerr]r ([OCRerr]uery ([OCRerr]9 a 73% impr([OCRerr]vement in recall. b. InoIi,li()loI[OCRerr]ic[OCRerr]'[OCRerr]l [OCRerr]`ui[OCRerr]'ilysis Ilolle 4. tenn weighting None. Weighting w[OCRerr][OCRerr]' planned Ilut could not l)e implemented given limitations that arose. 5. phrase discovery a. what kind of phr[OCRerr]'tse? Word pairs occurring within three word positions Of one another. b. usin(g st[OCRerr]'[OCRerr]istical ineth(Kls All such pairs were identified. c. usin[OCRerr] sylitaclic methods 6. syntactic p[OCRerr]'u'sint,' none 7. word sense dis[OCRerr]nbiguation fl()flC 8. heuristic aNsociations a. short definition of these Ł`L';s()ci[OCRerr]'[OCRerr]ti()IIs Only the l)asic pairing ass(iciatioIL[OCRerr] used. 9. spellinL' checkinLY (with manuŁ'il correction) none 10. spellin[OCRerr] correction none 464