NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)

SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Design and Evaluation of the CLARIT-TREC-2 System chapter D. Evans R. Lefferts National Institute of Standards and Technology D. K. Harman Using TREC-2 processing results as a baseline, we have begun exploring several of the parameters of query augmentation in the CLARIT-TREC-2 system. The first set of experiments focuses on three parame- ters: * The use of query augmentation compared to a pro- cedure involving no query augmentation; * The size of relevant document samples as source collections for thesaurus extraction; and * The `threshold' chosen for the thesaurus discovery process. 5.1 Query Augmentation vs. No Augmentation In order to verify that CLARIT query augmentation techniques do have a positive effect on retrieval results, we have compared the official TREC-2 results against experimental runs that do not take advantage of any augmentation. For the routing queries, we re-ran the routing task using only the raw queries as vectors (con- sisting of terms from the topic statement only). The effects for manual and automatic modes are shown in Figure 7. In the case of the ad-hoc queries, the unaug- mented results are simply the results of the initial round of querying, before automatic feedback has taken place. The effects for both modes are shown in Figure 8. Query augmentation using known-relevant docu- ments, as in the routing experiment, has a dramatic ef- fect on system performance. The manual routing task showed a 69% improvement with augmentation, while automatic routing improved by 76%. Figure 9 shows the effect of query augmentation for individual routing topics, calculated as the difference in average precision between augmented and unaugmented queries. Note that very few queries do not improve with augmenta- tion; most show great improvement. Query augmentation is not as effective when used for automatic feedback in processing ad-hoc queries, but it still results in substantial improvements. For manual ad-hoc queries, we found a 21% improvement when using query augmentation; for automatic ad-hoc queries, a 22% improvement. Obviously, the use of known-relevant documents in the case of routing top- ics has an impact on query augmentation, but it is not the only important factor. The query-by-query effect of augmentation for ad-hoc queries is shown in Fig- ure 10. while the effect is not as great as with routing topics, most queries show improvement with augmen- tation. By comparing the query-by-query results for automatic and manual modes, one can see that manual intervenfion in query formulation improves the results of augmentation in specific cases, such as query 121, but the positive effect is difficult to predict. Even with- 145 out user review, we can have confidence in the ability of query augmentation to improve query results, given reasonably accurate initial query formulation. We believe that the techniques we used in CLARIT- TREC-2 automatic feedback can be refined to give bet- ter documents as input to query augmentation. First, due to engineering and time constraints in TREC-2 processing, we did not select the best sub-documents from the entire corpus. Instead, we selected the best sub-documents from a pool of the best relevant docu- ments, which might not always correspond to the op- timal set of sub-documents for query augmentation. Second, the TREC-2 process used an absolute num- ber of sub-documents-the top N-in query augmen- tation, regardless of the `similarity' scores of those sub- documents to the source query. while it may not be p05- sible to determine `absolute' relevance on a query-by- query basis, minimum thresholds might be applied to exclude clearly irrelevant sub-documents, should such be within the N otherwise to be used in augmentation. 5.2 Source Sample Size In the case of the submitted results of CLARIT-TREC-2 processing, the thesaurus for each routing query was extracted from the set of all its known-relevant train- ing documents. However, we can imagine that many such documents contain som[OCRerr]perhaps a great deal of-information that is not relevant to the topic at hand. (This is especially expected with long docu- ments.) Therefore, we have experimented with alterna- tive approaches to sampling text from known-relevant documents. In particular, we have used a ranking of sub-documents (paragraphs) to nominate candidate `good' relevant texts and have used variable numbers of sub-documents as source text in thesaurus extraction. In practice, to select source text for the thesaurus- discovery process for a topic, we run the raw topic vector as a query over the collection of relevant docu- ments, which are partitioned into sub-documents, and return only the top N relevant sub-documents. (In cases where the topic has fewer than N sub-documents in the collection of relevants, all sub-documents are selected.) Our hypothesis is that thesauri extracted from relevant sub-documents will contain greater numbers of `true- positive' terms related to the topic. We have run the experiment for N = 100, 200, 300,500, and 1000 as the cutoff for selected sub-documents. Figure 11 gives a sample of the results for routing topics, including the best case of 500 sub-documents. while the differences for different N are not great, certain trends are evident. First, it is clear that using only the more similar sub-documents from a collec- tion of relevants gives better results than using all rele- vant full documents. (This represents an important im-