NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)

SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Appendix C: System Features appendix National Institute of Standards and Technology Donna K. Harman System Summary and Timing vPI & sU General Coininenis The timings should be the tilne to replicate runs from scratch, not including trial run 5, etc. The t[OCRerr]es should also be re[OCRerr]'tsonably accurate. This soifletitnes Will be dimcult, such Ł[OCRerr] getting total time ft)r document indexing of huge text sections, or `naliu[OCRerr]'tlly building a k'iowledge b[OCRerr]'Lse. Please do your best. I. Construction of indices, knoWledge bises, and other data structures (ple[OCRerr][OCRerr]se describe all data structures that your system needs for se[OCRerr]'u'chi'ig) A. Which of [OCRerr]e folloWing were used to build y[OCRerr][OCRerr]ur data structures'! 1. st()pW()rd list yes [OCRerr]I. 110W In my Words in list'? 41 [OCRerr] 2. is LI controlled v([OCRerr]('LbulL'n'y used? 11([OCRerr] 3. stCIfl111int[OCRerr] [OCRerr] 4. teilli Weighting Vector and p-n([OCRerr]rni runs were d[OCRerr][OCRerr]ne with n([OCRerr] term weights. Vector runs were aLso perf[OCRerr])rIned with aug[OCRerr]n([OCRerr]rn1 * idf weighting. 5. phrase discovery no 6. syntactic p('[OCRerr]sing 110 7. Word sense dis[OCRerr][OCRerr]nbiguati()n 11([OCRerr] 8. heuristic [OCRerr]L[OCRerr]s()ciati()ns Ilo 9. spelling checking (With manual collection) no lo. spelling correction 110 11. proper IIOUll identificŁ'Iti()n Ł`Ilg()ri[OCRerr]ln As pr(Jvided in SMART 12. tokenizer (recogil izes dItes, phone numbers, C()I[OCRerr]()11 patterils) As provided in SMART 13. Ł`ire the in[OCRerr]'uiuŁ'dly-indexed terms used'! not used as suggested in guidelines 14. other techniques used to build d[OCRerr]It[OCRerr]'I structures (brief descuption) 1983 VerSIon of SMART, enhanced with VPI&SU routines B. SL[OCRerr]tistics on [OCRerr]Ita structures built fR)In TREC text (ple[OCRerr]'Ise fill out each applicable section) Except it you want Us to answer under 4 here re the knowledge l)ase used to help Iluild our Boolean (lueries, please advise. 5. other [OCRerr]It[OCRerr]'I structures built from TREC text (what?) Document vector tile and term dictionary LI. toLIl Ł`u'1()unt of storige (IllegIbytes) Approx. 15Nil[OCRerr] t[OCRerr])r the dictionary and 121 l,IB for the Document vector file for the entire [OCRerr]Vall Street journal collection. b. told computer tune to build (`Ippr()xiln[OCRerr]'Ite number of hours) Approx. time t([OCRerr] build above lo hours ([OCRerr]n ccrdl (DECstation 5([OCRerr]N[OCRerr] Model 25, i.e., a MIPS R3([OCRerr]()([OCRerr] chip running at 2SMHz) C. is the pr('cess completely automatic'! yes d. brief description of methods used The document text is tokenized, stop words are thrown out, and non-noise words are kept in the term dictionary along with its occurrence frequency. Each term ill the dictionary has a unique identitication numller. The vector tile contains for each document its unique ID, and a vector of term ID and weights for the term. The weighting scheme is flexulle and can l)e changed to ()flC ([OCRerr]f several schemes after the indexing is complete. (If necessary we can till in details here. Please advise.) 510