NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)

SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Appendix C: System Features appendix National Institute of Standards and Technology Donna K. Harman System Summary and Timing Carnegie Mellon University General Comments The timings should be the time to replicate ruus from scratch, `lot including trial runs, etc. The times should also be ttasonably accurate. This sometilnes will be difficult, such as getting tot[OCRerr] time for document indexing of huge text sections, or m[OCRerr][OCRerr]iually building a knowledge base. Please do y[OCRerr][OCRerr]ur best. I. Construction of indices, knowledge b[OCRerr]i[OCRerr]es, and other data structures (please describe all data structures that your system needs for se£'irc'1ing) A. Which of the following were used to build your data structures'? 1. stopword list No. But the NLP/m()rph()l()gical-analysis components of the system do limit the possible lexical categories of SoniC English words to eliminate useless ambiguities. For example, `9l)ut" is given lexical category "cnj" (conjunction) and not alternative, possible categories, such as "sn" (5ingular.n()un); "can" is limited to category "auxm" (nl(KIal.auxiliary.verl)) and not "sn91; etc. Such selective restrictions have some ([OCRerr]f the effects (jf "stop-word" lists, since spurious (or irrelevant) categories will not enter into later indexing stages. Furthermore, the NLPlparsing components of the system return simplex noun phrases (NI's) as candidate terms in which some components of the NP are eliminated, such as (luantitiers (e.g., "many", "one", etc.), determiners (e.g., "the", "a", etc.), and c(Jnjuncti()ns (e.g., "and", "or", etc.). In addition, in normal CLARIT NP processing, the parser does not return prepositions, non-NP adverbs, and extra-NP elements. Tli is practice, therefore, aLso has the effect of eliminating items that normally appear on "stop-word" lists. It clearly goes beyond that practice in eliminating all extra-NP words as well. a. how many words in list'? Approximately 1([OCRerr]() lexical items have been given restrictive syntactic treatment, ill addition t([OCRerr] the words with unambiguously empty categories. 2. is a controlled v([OCRerr]abul[OCRerr]'Lry used'? No 3. stemming No a. st[OCRerr]ind£[OCRerr]d stelnining algorithms which ones'? b. Inorpholo[OCRerr]'ical alialysis Yes. The Morph component of the system provides for comprehensive inflectional.m()rph()l()gical analysis. In practice, the morph-i[OCRerr]ormal form of nouns and adjectives is used in the NP-based terms of the system. Participles are not morphologically reduced (though it is possible to do so). Derivati()nal.m()rphol()gical analysis is not used. A lexicon of approximately [OCRerr] `r()()t-f[OCRerr])rnl' items (English words) is the principal resource used by Morph in addition to its morphological rule set. 4. tenn weightin" Yes/No. The CLARiT process uses NLP to identify candidate terms in route to indexing, development of [OCRerr]ss()ciated resources (e.g., thesauri), and analysis of queries or topics. These are taken as the `information units' of interest and are analyzed statistically and heuristically. `Weights' are ass('ciated with terms at various stages of pr(wessing. In indexing TREC documents, for example, an IDFfrF score was associated with terius for each document. In the case of multi-word terms (the 494