NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)

SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) CLARIT TREC Design, Experiments, and Results chapter D. Evans R. Lefferts G. Grefenstette S. Handerson W. Hersh A. Archbold National Institute of Standards and Technology Donna K. Harman A sample of a document after morphological analysis is given in Figure 7. A sample of the same document after simplex-NP extraction is given in Figure 8. Note that "owned holding company" and "$295" or "$370" are treated as NPs along with legitimate phrases like "comput- erized industrial control system". While CLARIT does have facilities to discover and eliminate inappropriate participles (such as "owned" in isolation) and can recognize nonce adjectives, such as "state-owned", such processing was not employed in the TREC tasks. Hence, the cor- rect expression, "Italian state-owned holding company" was not found or used in this case. In addition, as noted previously, the CLARIT-TREC system did not `tokenize' company names or dates or other `regular-expression'-like phrases; there was no time in our schedule for such processing. All NLP (and other) processing steps were piped through the system; intermediate files were not retained. The parsed representation of all the texts took up approximately 98% of the space occupied by the original text. Intermediate (but unretained) files generated in CLARIT processing included a file of the words in each text, in their original order, annotated with morphological categories. Other files contained the output of the parser as a list of NPs in the order in which they occurred in each text. The parsed representation of the text was retained and used at all subsequent steps of processing. Indeed, hereafter, unless otherwise specified, any reference to a document or collection of documents refers to the CLARIT representation of the text, viz., a sequence of normalized simplex NPs.15 4.3 Identifying Terms from Topics All fields of topic statements, such as given in Figure 9, were similarly processed for NPs. Team members reviewed the NPs and assigned weights of "1", "2", or "3" to each NP according to whether the term was central or peripheral to the topic. (Some extracted NPs were discarded as irrelevant or ill-formed; the vast majority were retained.) A sample set of weighted terms for the topic in Figure 9 is given in Figure 10. The manual review and weighting of terms from the topic statement took less than 5 minutes per topic. All subsequent processing of the query was performed automatically. 4.4 Establishing Sets of `Relevant' Documents Given the need to `evoke' candidate documents and to `partition' the database into subsets that were easier to manage, we were naturally interested in identifying features in the topics that would be useful as discriminators. We had little confidence, however, that the specific terms in topics, which constitute the "source query", were either most respresentative of the domain of the topic (= the `satisfaction class') or reasonably comprehensive. We thus decided to supplement the source query with additional terms. In particular, we used the CLARIT thesaurus-discovery technique on known relevant doc- uments to identify terminology that might be better representative of the satisfaction-class documents than the source query alone. The process produced a list of terms from the avail- able topic-relevant documents (or from a small sample of relevent documents that we may have found) and automatically nominated the top (approximately 20%) ranked terms to supplement the original query (as derived from the topic statement) to produce a "routing/partitioning thesaurus" for the topic. Since the routing topics already had accompanying relevant documents, we used these as a source of additional terminology. Ad-hoc queries, on the other hand, had no associated relevant documents, so we designed a preliminary, partial `retrieval' step that would help us 15From the point of view of the CLARIT system, the information in a document is entirely represented by the extracted noun phrases. 261