NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)

SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Design and Evaluation of the CLARIT-TREC-2 System chapter D. Evans R. Lefferts National Institute of Standards and Technology D. K. Harman 2.2.2 Query Augmentation CLARIT thesaurus discovery is a process that can iden- tify a set of core, representative terminology in a col- lection of documents. If the documents are relatively homogeneous topically, the resulting `first-order' the- saurus can be regarded as a broad `signature' of the topic of the collection. Typically, the procedure is most reliable when the collection is large-2 megabytes or more of text. In the CLARIT-TREC-2 system, how- ever, thesaurus discovery was used on the relatively small collections of relevant documents for each topic. For each routing topic, the document collection used to establish a first-order thesaurus consisted of the train- ing set of relevant documents for each topic; for each ad-hoc topic, a set of the top-ranked sub-documents from among the documents automatically retrieved via an inltial pass of querying over the corpus. Terms in the discovered thesaurus were used to supplement the terms in the source query (topic). Thus, in the case of ad-hoc topics, the procedure represents an approach to query augmentation based on fully automatic feedback. 2.2.3 Text Analysis Heuristics Texts can have complex and interesting discourse struc- ture, which often provides clues about the `topic(s)' and the `important' information in a document. In general, it is very difficult to exploit text-structure information reliably in retrieval tasks over large collections of het- erogeneous documents. Nevertheless, the CLARIT- TREC-2 system applied two simple processing tech- niques to topics and corpus documents to attempt to capture information encoded in rhetorical structure. First, all training and corpus documents were divided into paragraph-sized units called "sub- documents". This procedure is sensitive both to the `normal' demarcation of paragraphs (successive blank lines) and also to the total length of the text (measured in numbers of sentences). After documents are partioned into sub-documents, the sub-document is taken as the basic unit for subsequent processing, viz., in collecting statistics and in scoring retrieval. Second, terms extracted from a topic description were assigned importance coefficients based on their locations in the topic text. Terms found in the first paragraph are given a weight of 3; terms in the second paragraph, 2; and all other terms, 1. Of course, both techniques exploit possibly idiosyn- cratic characterisitics of the TREC-2 processing task. The use of scoring over sub-documents is clearly sen- sitive to the TREC definition of document relevancy, viz., that a document is relevant regardless of length if it contains a single relevant sentence. Furthermore, au- 139 tomatic term importance weighting is possible only be- cause of the formal discourse structure of TREC topics; it would not neccessarily apply to other presentations of topics or queries and certainly would not apply to free text in general. We observe, however, that there is no one set of techniques that will perform optimally in every information-retrieval (IR) situation. We emphasize, in- stead, that one important measure of a system's utility and adaptability is its ability to take advantage of ex- ploitable features in a given IR task. 2.3 System Performance Notes All CLARiT-TREC-2 processing took place on DEC 3000/400 (ALPHA/AXP) workstations ruling DEC 0SF/i. One system had 128 megabytes of RAM, one 64 megabytes, and two 32 megabytes. Realized per- formance was considerably slower than would be ex- pected given the clockrate (133.33 MHz) of the DEC 3000/400's CPU. In fact, because of suboptimal com- pilers for the 64-bit architecture, performance was per- haps two times slower than the maximum possible. One pass over the entire TREC-2 collection for 50 top- ics (processed simultaneously) required approximately four hours, rurining on the four machines in parallel. The processing of a single topic is proportionally faster, requiring approximately 20 minutes on a single ma- chine. 3 Results The CLARiT team processed all the topics in both the "routing" and "ad-hoc" categories and worked with the full set of data. Two sets of results were submit- ted for each category, corresponding to the "manual" ("CLARTM") and fully-"automatic" ("CLARTA") pro- cessing approaches taken with the topics. 3.1 General Summary of Official Results Table 1 gives the official CLARIT-TREC-2 system rout- ing results as reported by MST. A graph of the precision-recall curves for the two sets of results is given in Figure 2. The total number of documents retrieved under the routing task was 6,785 (CLARTM) and 6,811 (CLARTA), representing, respectively, 64.69% and 64.93% of the total known relevants (10,489).