SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Appendix C: System Features appendix National Institute of Standards and Technology Donna K. Harman algorithm re(luires that at least two original text words agree on a synonym hefore it is added to the vector. The effect ([OCRerr]f this is to do a [OCRerr]()()[OCRerr] man's version of sense disaml)iguati()n for the synonyms. 8. heunstic Ł[OCRerr][OCRerr]s()ciŁ'ttions a. short definition of these [OCRerr]sociations W()rdNet synonymy relation only association used. 9. spelling checking (with inatiull con-ection) no 10. spelling correction no 11. proper i)OUII identificition [OCRerr]`tlgori[OCRerr]in no 12. tokenizer (reco(,'nizes d[OCRerr]-[OCRerr]tes, phone numbers, coi[OCRerr]on pattenis) no 13. Łu-e the mL[OCRerr]1u[OCRerr]-tlly-indexed terms used? no 14. other techniques used to build dŁ'Lta structures (brief description) no B. Statistics on data structures built from FREC text (please fill out each applicable section) 1. inverted index a. total [OCRerr][OCRerr]ount of st()r[OCRerr][OCRerr]2e (inegŁ-[OCRerr]b ytes) 947 megal)ytes ([OCRerr]f disk storage b. total computer time to build (appr()xilnate number of hours) 5 hours to l)uild index given document vectors; document vectors took 37 hours t(J l)uild from text. Thus, approximately 42 hours to go from text to inverted index. C. is the [OCRerr]R[OCRerr]C55 completely [OCRerr]-1ut()In[OCRerr]ttic? yes d. [OCRerr]u-e terin positions wi[OCRerr]iii d(Xulnents stored'? No term position information maintained. e. single tCrins olily? Single terms only (although, as stated al)()ve, a single term from WordNet may l)e a collocation such as `electrical_discharge'). 2. n-grŁ-uns, suffix aiTays, siL'nature tiles N-grams and signature tiles not used. SMART stemmer algorithm incorporates a (static) trie of suffixes. 3. knowledge bases No knowledge l)ase used other than W()rdNet (descril)ed under I.C.2). C. Data built from sources other th[OCRerr]-[OCRerr] [OCRerr]e input text 1. inteni[OCRerr]i]ly-built auxiliai-y tiles Il()[OCRerr]C 2. externuly-built [OCRerr]-[OCRerr]uxili[OCRerr]'u-y lile a. type of tile [OCRerr][OCRerr]-eebank, \V()rdNet, etc.) W()rdNet (noun portion only) b. t()tL[OCRerr]l c. total d. type [OCRerr][OCRerr]()uI1t of st()r-t[OCRerr]'e (IneLT-Ibytes) 5 megal)ytes number of concepts represented 35,155 syn[OCRerr][OCRerr]nym sets (67,293 word senses) of represeflt(-iti()Il (fr(-[OCRerr]es, ,[OCRerr]eIn[OCRerr]-[OCRerr]tic nets, rules, etc.) We used only the syn[OCRerr][OCRerr]nymy relation that W()rdNet contains. However, W()rdNet contains many other lexical relationships making it similar to a semantic net. II. Query construction (please fill out a section 1;()r etch query colisti-uction method used) [We sul)mitted oliC set of results; those results were for automatically huilt ad hoc (lueries.] A. Autom[OCRerr]-ttic[OCRerr]tlly built queries (ad hoc) 1. topic fields used C([OCRerr]ncepts (<con>), Description (<desc>), Factors (<fac>), Narrative (<narr>), Nationality (<nat>), Title (<title>) 2. total computer titne to build query (cpu seconds) 1 second, [OCRerr] average (5[OCRerr]) seconds to index 50 (lueries) 517