SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Appendix C: System Features
appendix
National Institute of Standards and Technology
Donna K. Harman
norm), the terms are assigned IDF/TF scores, and each word in the term is broken
out and assigned an independent IDF/TF score.
5. phrase discovery Yes
a. what kind of phrase?
Simplex noun phrases (= all moditiers and the head of the NP but no
deternijners, t1uantitiers, or post[OCRerr]head[OCRerr]position m([OCRerr]ifying phrases or
clauses).
b. using statistical rneth(xis
No. NPs retained fi)r thesaurus creation are scored using statistically-based
measures [OCRerr])f expected `rarity' (based on component words), distribution,
fre(1uency, and coverage. But N1[OCRerr]s are not identified in texts based on
statistical parsing, for example.
C. using syiltactic methods
Yes. NI's are discovered using a parser that implements a `heuristic'
grammar. In particular, following word-for-word morphological analysis
(resulting in a set of syntactic-category tags t[OCRerr])r each word encountered in
a text), the parser identities the sul)se(luences that form NI's. Identification
of NI's is based on rules that perf[OCRerr])rm NI'[OCRerr]b()undary-c()ndition tests.
6. syntactic p[OCRerr]irsing Yes (see above). A single-pass parser follows morphol()gical analysis.
7. word sense dis[OCRerr]biguati()n
No. No attempt is made to control for word senses in morphological or syntactic
analysis. As noted above, disambiguation of grammatical categories is facilitated by
restricting possible categories for selective items. In addition, absolute preferences
are established for grammatical categories appearing in n[OCRerr][OCRerr]un phrases.
8. heuristic [OCRerr][OCRerr][OCRerr]sociations
a. short definition of these [OCRerr]L[OCRerr]5ociations
Yes. The principal relation the system currently uses is that of `similarity'
of terms. `Similarity' is determined by different procedures in different
contexts. For example, partial or `fuzzy' matching of terim[OCRerr] is facilitated by
noting whether terms share words or attested sul)phrases. For example, in
vector-space modeling of documents, the contained words of all terms (in the
document vector as well as the query vector) are broken out, giving, in
effect, the possibility [OCRerr] matching parts of terms (though, technically, the
individual words are realized as independent dimensions of the space). In
addition, in nominating terms for inclusion in thesauri and in matching
terms to thesauri, CLARIT processing takes account of contained words and
attested sul)phrases.
9. spelling checkin[OCRerr] (with rn[OCRerr]'[OCRerr]nu[OCRerr][OCRerr]l ColTection) No
10. spelling correction No
11. proper noun identification [OCRerr]Ll[OCRerr]()ri[OCRerr]In
YesINo. The system provides for identification [OCRerr] `candidate proper nouns' b[OCRerr][OCRerr]ed
on morphological analy %[OCRerr][OCRerr]% (F%sentially, since the morphological analysis is virtually
exhaustive for English, words that cannot be mapped to specific lexical ite[OCRerr]s are
given the provisional label "cpn"--'candidate proper noun'--and parsing proceeds
accordingly.) There is a facility in CLARIT for highly-reliable proper name
(including acr([OCRerr]nym) identification, but it was not used in this round of TREC
processing.
12. tokenizer (recognizes dates, phone numbers, ColTilTIOll pattenis)
a. which patteills [OCRerr] tokenized?
Certain common abbreviations are included in the lexicon and, under
morphol([OCRerr]gical processing, are rendered into normalized forius. The system
can utilize--and even partially discover--supplemental lexicons of
domain-specific abbreviations and other phrasal-lexical patterns, but this
facility was not used for TREC processing.
495