SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Appendix C: System Features
appendix
National Institute of Standards and Technology
Donna K. Harman
System Summary and Timing
Carnegie Mellon University
General Comments
The timings should be the time to replicate ruus from scratch, `lot including trial runs, etc. The times should also
be ttasonably accurate. This sometilnes will be difficult, such as getting tot[OCRerr] time for document indexing of huge
text sections, or m[OCRerr][OCRerr]iually building a knowledge base. Please do y[OCRerr][OCRerr]ur best.
I. Construction of indices, knowledge b[OCRerr]i[OCRerr]es, and other data structures (please describe all data structures that
your system needs for seŁ'irc'1ing)
A. Which of the following were used to build your data structures'?
1. stopword list
No. But the NLP/m()rph()l()gical-analysis components of the system do limit the
possible lexical categories of SoniC English words to eliminate useless ambiguities.
For example, `9l)ut" is given lexical category "cnj" (conjunction) and not alternative,
possible categories, such as "sn" (5ingular.n()un); "can" is limited to category
"auxm" (nl(KIal.auxiliary.verl)) and not "sn91; etc. Such selective restrictions have
some ([OCRerr]f the effects (jf "stop-word" lists, since spurious (or irrelevant) categories will
not enter into later indexing stages.
Furthermore, the NLPlparsing components of the system return simplex noun
phrases (NI's) as candidate terms in which some components of the NP are
eliminated, such as (luantitiers (e.g., "many", "one", etc.), determiners (e.g., "the",
"a", etc.), and c(Jnjuncti()ns (e.g., "and", "or", etc.). In addition, in normal CLARIT
NP processing, the parser does not return prepositions, non-NP adverbs, and
extra-NP elements. Tli is practice, therefore, aLso has the effect of eliminating items
that normally appear on "stop-word" lists. It clearly goes beyond that practice in
eliminating all extra-NP words as well.
a. how many words in list'?
Approximately 1([OCRerr]() lexical items have been given restrictive syntactic
treatment, ill addition t([OCRerr] the words with unambiguously empty categories.
2. is a controlled v([OCRerr]abul[OCRerr]'Lry used'? No
3. stemming No
a. st[OCRerr]indŁ[OCRerr]d stelnining algorithms
which ones'?
b. Inorpholo[OCRerr]'ical alialysis
Yes. The Morph component of the system provides for comprehensive
inflectional.m()rph()l()gical analysis. In practice, the morph-i[OCRerr]ormal form of
nouns and adjectives is used in the NP-based terms of the system.
Participles are not morphologically reduced (though it is possible to do so).
Derivati()nal.m()rphol()gical analysis is not used. A lexicon of approximately
[OCRerr] `r()()t-f[OCRerr])rnl' items (English words) is the principal resource used by
Morph in addition to its morphological rule set.
4. tenn weightin"
Yes/No. The CLARiT process uses NLP to identify candidate terms in route to
indexing, development of [OCRerr]ss()ciated resources (e.g., thesauri), and analysis of queries
or topics. These are taken as the `information units' of interest and are analyzed
statistically and heuristically. `Weights' are ass('ciated with terms at various stages
of pr(wessing. In indexing TREC documents, for example, an IDFfrF score was
associated with terius for each document. In the case of multi-word terms (the
494