NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)

SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Appendix C: System Features appendix National Institute of Standards and Technology Donna K. Harman norm), the terms are assigned IDF/TF scores, and each word in the term is broken out and assigned an independent IDF/TF score. 5. phrase discovery Yes a. what kind of phrase? Simplex noun phrases (= all moditiers and the head of the NP but no deternijners, t1uantitiers, or post[OCRerr]head[OCRerr]position m([OCRerr]ifying phrases or clauses). b. using statistical rneth(xis No. NPs retained fi)r thesaurus creation are scored using statistically-based measures [OCRerr])f expected `rarity' (based on component words), distribution, fre(1uency, and coverage. But N1[OCRerr]s are not identified in texts based on statistical parsing, for example. C. using syiltactic methods Yes. NI's are discovered using a parser that implements a `heuristic' grammar. In particular, following word-for-word morphological analysis (resulting in a set of syntactic-category tags t[OCRerr])r each word encountered in a text), the parser identities the sul)se(luences that form NI's. Identification of NI's is based on rules that perf[OCRerr])rm NI'[OCRerr]b()undary-c()ndition tests. 6. syntactic p[OCRerr]irsing Yes (see above). A single-pass parser follows morphol()gical analysis. 7. word sense dis[OCRerr]biguati()n No. No attempt is made to control for word senses in morphological or syntactic analysis. As noted above, disambiguation of grammatical categories is facilitated by restricting possible categories for selective items. In addition, absolute preferences are established for grammatical categories appearing in n[OCRerr][OCRerr]un phrases. 8. heuristic [OCRerr][OCRerr][OCRerr]sociations a. short definition of these [OCRerr]L[OCRerr]5ociations Yes. The principal relation the system currently uses is that of `similarity' of terms. `Similarity' is determined by different procedures in different contexts. For example, partial or `fuzzy' matching of terim[OCRerr] is facilitated by noting whether terms share words or attested sul)phrases. For example, in vector-space modeling of documents, the contained words of all terms (in the document vector as well as the query vector) are broken out, giving, in effect, the possibility [OCRerr] matching parts of terms (though, technically, the individual words are realized as independent dimensions of the space). In addition, in nominating terms for inclusion in thesauri and in matching terms to thesauri, CLARIT processing takes account of contained words and attested sul)phrases. 9. spelling checkin[OCRerr] (with rn[OCRerr]'[OCRerr]nu[OCRerr][OCRerr]l ColTection) No 10. spelling correction No 11. proper noun identification [OCRerr]Ll[OCRerr]()ri[OCRerr]In YesINo. The system provides for identification [OCRerr] `candidate proper nouns' b[OCRerr][OCRerr]ed on morphological analy %[OCRerr][OCRerr]% (F%sentially, since the morphological analysis is virtually exhaustive for English, words that cannot be mapped to specific lexical ite[OCRerr]s are given the provisional label "cpn"--'candidate proper noun'--and parsing proceeds accordingly.) There is a facility in CLARIT for highly-reliable proper name (including acr([OCRerr]nym) identification, but it was not used in this round of TREC processing. 12. tokenizer (recognizes dates, phone numbers, ColTilTIOll pattenis) a. which patteills [OCRerr] tokenized? Certain common abbreviations are included in the lexicon and, under morphol([OCRerr]gical processing, are rendered into normalized forius. The system can utilize--and even partially discover--supplemental lexicons of domain-specific abbreviations and other phrasal-lexical patterns, but this facility was not used for TREC processing. 495