NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)

SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Appendix C: System Features appendix National Institute of Standards and Technology Donna K. Harman System Summary and Timing Siemens Corporate Research, Inc. General Coininents The timings should be the time to replicate runs from saatch, not including trial runs, CtC [OCRerr])C tilnes should also be re[OCRerr][OCRerr][OCRerr]onably accurate. This solnetilnes will be diflicult, such [OCRerr] getting total time for document indexilig of huge text sections, or m([OCRerr]ually building a [OCRerr]()wledge bΩOCRerr]LsC. Ple[OCRerr]Lse do YoUr best. Summary of method: Completely aut()mJ tic vector matching where both document and (iuery vectors have l)eefl expanded using syn[OCRerr][OCRerr]nyms extracted from W()rdNet. I. Consti-uction of indices, know ledge b[OCRerr]i[OCRerr]es. [OCRerr][OCRerr]nd other da[OCRerr] sti-uctures (please describe all data structures that your system needs for seuching) A. Which of the following were used to build your data sti-uctures? 1. st()pw()rd list [OCRerr]i. how many words in list? 571 word st()pw()rd list used (standard SMART st()pword list) 2. is a controlled v([OCRerr]abulafy used? Ilo 3. stenlinin[OCRerr]' [OCRerr] stand-ud steinining [OCRerr]-LIg()ri thins which olles? b. m()i[OCRerr]ph()l()gic([OCRerr]l (-ulilysis Extremely simple suffix stripper to look words up in W()rdNet. (Checks for olle of 22 suflixes and p()ssil)ly modifies end ([OCRerr]f stem if a matching suffix is found. This was in code I inherited--I don't know the source of the sufrix list, l)ut the list is a sul)set ([OCRerr]f that used l)y SMART, so it prol)al)ly comes fl[OCRerr]()m SolliC "standard" algorithm.) All words aLso pass through the "triestem" stemmer ([OCRerr]f SMART. This stemmer was originally hased on I[OCRerr]()vin's CACM article, l)ut has evolved over the years. 4. telin weighung A tf*idf weight is used fi)r hoth i[OCRerr]uery and document terms, where the weight is further n()rniali[OCRerr]'[OCRerr]ed so that an inner product computation produces the cosine ("tfc" weights using the ternimology of "Term [OCRerr]eighting Approaches in Automatic Text Retrieval" l)y Silt([OCRerr]n and l[OCRerr]uckley). A term is counted as appearing in a document (for idf purposes) if it was in the original text ()V If it was added as a synonym. The tt[OCRerr]idf portion of an added term's weight is multiplied hy .8 to produce its final weight. 5. phr[OCRerr]'L[OCRerr]e disc()veI[OCRerr] [OCRerr]t. wh([OCRerr]1t kind of phr(-[OCRerr]se? b. usin[OCRerr] stitisticLI Ineth('ds c. usintT s[OCRerr]tactic methods W()rdNet contains c(,ll()cati()ns as meml)ers of synonym sets, so some phrases may l)e added as synonyms. However, such a collocation is assigned a uni(lue concept numl)er and will ([OCRerr]nly match that exact collocation (so I don't consider it to l)e "phrasing"). No other phrasing used. 6. syntactic p(-u;[OCRerr]in(T Ilo 7. word sense dis(-unbiguati[OCRerr][OCRerr]n No specific sense disaIiil)iguati()n procedure used. If a term [OCRerr][OCRerr]ccurs in more than one [OCRerr]([OCRerr]rdNet syn[OCRerr][OCRerr]nym set (which, hy definition, means that it is polysemous), the syn[OCRerr][OCRerr]nynis from all of its senses may potentially he added to the vector. The 516