NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)

SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Appendix C: System Features appendix National Institute of Standards and Technology Donna K. Harman d. are te[OCRerr] positions wi[OCRerr]in d('cuInents stored? no e. single terms only? yes C. Data built from sources other thŁ[OCRerr]i tlie input text yes 1. inteni[OCRerr]'illy-built auxili[OCRerr]iry files yes, a semantic lexicon [OCRerr]1. domain independent ()[OCRerr] d()m[OCRerr]Lin specific (if two separate files, please fill out one set of questions for each file) Domain independent b. type of file (thesaurus, knowledge bŁ'L[OCRerr]e, lexicon, etc.) Semantic lexicon l)uilt l)y examination of Roget's Thesaurus. C. tot[OCRerr]il Łunount of storage (megabytes) 0.34 megal)ytes. d. total number of Concepts represented There are 36 semantic categories and there are approximately 24,E[OCRerr]0 words in tw([OCRerr] lexicons with the categories they trigger. The prohahility of each triggered category is aLso stored. e. type of representation (fr[OCRerr]es, semantic nets, rules, etc.) It could he viewed as rules. t.. total computer time to build (approximate number of hours) (1) if abeady built, how much time to modify for TREC? Since the 1911 edition of' Roget's Thesaurus hecame pul)lic domain recently, we spent approximately 16 hours creating the software to pr([OCRerr]cess the 1911 Thesaurus. Approximately 6 hours of processing time was required to automatically extract 20,(WO lexicon entries. However, we did not have time to explore the use of these entries. g. t()t[OCRerr]'[OCRerr]l niwual tilne to build (approximate number of hours) (1) if already built, how much t[OCRerr]e to modify for TREC? Pn()r to TREC, there were 3,(HX[OCRerr] entries in the lexicon established by manual processing of approximately 6,000 words in 300 hours. For TREC, we made 1,(HHJ new entries (in 85 hours) by examination [OCRerr] 1,7([OCRerr]) frequently occurring words found in the training topi[OCRerr]s and the training text. S([OCRerr], the lexicon we used had 4,(Hlt[OCRerr] entries in it. Ii. use of manual l[OCRerr]'ib()r (4) o[OCRerr]er (describe) Refer to (t) and (g). 2. exten'Ł'dly-built L'1uxili[OCRerr]uy file Ilo II. Query construction (please fill out a section for e[OCRerr]'ich query construction method used) A. Automatically built queries ((`Id hoc) yes 1. topic fields used All fields 2. toL[OCRerr] computer tilne to build query (cpu seconds) 1 second 3. which of the following were used.? f. tokenizer (recoL'nizes d[OCRerr]tes, phone numbers, [OCRerr]OInlfl()I' patterns) Dates recognizal)le but not used. Ii. exp[OCRerr]'ulsi()n of queries using previously-constructed dŁ[OCRerr]ta structure (from part I) (1) which structure'? Semantic lexicon described in I.C.1. j. other (describe) Term weighting based ([OCRerr]n terms in training text. D. Aut()matically built queries (r()utin[OCRerr]) yes 1. topic fields used All fields 2. total computer tilne to build query (cpu seconds) 1 second. 3. which of tlie following were used in buildin[OCRerr] tlie query'! a. te[OCRerr]s selected from (1) topic b. teun weighting (2) with weights based on temis in all trainin[OCRerr] docwnents 481