SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Appendix C: System Features
appendix
National Institute of Standards and Technology
Donna K. Harman
System Summary and Timing
Siemens Corporate Research, Inc.
General Coininents
The timings should be the time to replicate runs from saatch, not including trial runs, CtC [OCRerr])C tilnes should also
be re[OCRerr][OCRerr][OCRerr]onably accurate. This solnetilnes will be diflicult, such [OCRerr] getting total time for document indexilig of huge
text sections, or m([OCRerr]ually building a [OCRerr]()wledge bŁ[OCRerr]LsC. Ple[OCRerr]Lse do YoUr best.
Summary of method: Completely aut()mJ tic vector matching where both document and (iuery vectors have
l)eefl expanded using syn[OCRerr][OCRerr]nyms extracted from W()rdNet.
I. Consti-uction of indices, know ledge b[OCRerr]i[OCRerr]es. [OCRerr][OCRerr]nd other da[OCRerr] sti-uctures (please describe all data structures that
your system needs for seuching)
A. Which of the following were used to build your data sti-uctures?
1. st()pw()rd list
[OCRerr]i. how many words in list?
571 word st()pw()rd list used (standard SMART st()pword list)
2. is a controlled v([OCRerr]abulafy used? Ilo
3. stenlinin[OCRerr]'
[OCRerr] stand-ud steinining [OCRerr]-LIg()ri thins
which olles?
b. m()i[OCRerr]ph()l()gic([OCRerr]l (-ulilysis
Extremely simple suffix stripper to look words up in W()rdNet. (Checks for
olle of 22 suflixes and p()ssil)ly modifies end ([OCRerr]f stem if a matching suffix is
found. This was in code I inherited--I don't know the source of the sufrix
list, l)ut the list is a sul)set ([OCRerr]f that used l)y SMART, so it prol)al)ly comes
fl[OCRerr]()m SolliC "standard" algorithm.) All words aLso pass through the
"triestem" stemmer ([OCRerr]f SMART. This stemmer was originally hased on
I[OCRerr]()vin's CACM article, l)ut has evolved over the years.
4. telin weighung
A tf*idf weight is used fi)r hoth i[OCRerr]uery and document terms, where the weight is
further n()rniali[OCRerr]'[OCRerr]ed so that an inner product computation produces the cosine ("tfc"
weights using the ternimology of "Term [OCRerr]eighting Approaches in Automatic Text
Retrieval" l)y Silt([OCRerr]n and l[OCRerr]uckley). A term is counted as appearing in a document
(for idf purposes) if it was in the original text ()V If it was added as a synonym. The
tt[OCRerr]idf portion of an added term's weight is multiplied hy .8 to produce its final
weight.
5. phr[OCRerr]'L[OCRerr]e disc()veI[OCRerr]
[OCRerr]t. wh([OCRerr]1t kind of phr(-[OCRerr]se?
b. usin[OCRerr] stitisticLI Ineth('ds
c. usintT s[OCRerr]tactic methods
W()rdNet contains c(,ll()cati()ns as meml)ers of synonym sets, so some phrases
may l)e added as synonyms. However, such a collocation is assigned a
uni(lue concept numl)er and will ([OCRerr]nly match that exact collocation (so I
don't consider it to l)e "phrasing"). No other phrasing used.
6. syntactic p(-u;[OCRerr]in(T Ilo
7. word sense dis(-unbiguati[OCRerr][OCRerr]n
No specific sense disaIiil)iguati()n procedure used. If a term [OCRerr][OCRerr]ccurs in more than one
[OCRerr]([OCRerr]rdNet syn[OCRerr][OCRerr]nym set (which, hy definition, means that it is polysemous), the
syn[OCRerr][OCRerr]nynis from all of its senses may potentially he added to the vector. The
516