SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Design and Evaluation of the CLARIT-TREC-2 System
chapter
D. Evans
R. Lefferts
National Institute of Standards and Technology
D. K. Harman
Using TREC-2 processing results as a baseline, we
have begun exploring several of the parameters of
query augmentation in the CLARIT-TREC-2 system.
The first set of experiments focuses on three parame-
ters:
* The use of query augmentation compared to a pro-
cedure involving no query augmentation;
* The size of relevant document samples as source
collections for thesaurus extraction; and
* The `threshold' chosen for the thesaurus discovery
process.
5.1 Query Augmentation vs. No Augmentation
In order to verify that CLARIT query augmentation
techniques do have a positive effect on retrieval results,
we have compared the official TREC-2 results against
experimental runs that do not take advantage of any
augmentation. For the routing queries, we re-ran the
routing task using only the raw queries as vectors (con-
sisting of terms from the topic statement only). The
effects for manual and automatic modes are shown in
Figure 7. In the case of the ad-hoc queries, the unaug-
mented results are simply the results of the initial round
of querying, before automatic feedback has taken place.
The effects for both modes are shown in Figure 8.
Query augmentation using known-relevant docu-
ments, as in the routing experiment, has a dramatic ef-
fect on system performance. The manual routing task
showed a 69% improvement with augmentation, while
automatic routing improved by 76%. Figure 9 shows
the effect of query augmentation for individual routing
topics, calculated as the difference in average precision
between augmented and unaugmented queries. Note
that very few queries do not improve with augmenta-
tion; most show great improvement.
Query augmentation is not as effective when used
for automatic feedback in processing ad-hoc queries,
but it still results in substantial improvements. For
manual ad-hoc queries, we found a 21% improvement
when using query augmentation; for automatic ad-hoc
queries, a 22% improvement. Obviously, the use of
known-relevant documents in the case of routing top-
ics has an impact on query augmentation, but it is not
the only important factor. The query-by-query effect
of augmentation for ad-hoc queries is shown in Fig-
ure 10. while the effect is not as great as with routing
topics, most queries show improvement with augmen-
tation. By comparing the query-by-query results for
automatic and manual modes, one can see that manual
intervenfion in query formulation improves the results
of augmentation in specific cases, such as query 121,
but the positive effect is difficult to predict. Even with-
145
out user review, we can have confidence in the ability
of query augmentation to improve query results, given
reasonably accurate initial query formulation.
We believe that the techniques we used in CLARIT-
TREC-2 automatic feedback can be refined to give bet-
ter documents as input to query augmentation. First,
due to engineering and time constraints in TREC-2
processing, we did not select the best sub-documents
from the entire corpus. Instead, we selected the best
sub-documents from a pool of the best relevant docu-
ments, which might not always correspond to the op-
timal set of sub-documents for query augmentation.
Second, the TREC-2 process used an absolute num-
ber of sub-documents-the top N-in query augmen-
tation, regardless of the `similarity' scores of those sub-
documents to the source query. while it may not be p05-
sible to determine `absolute' relevance on a query-by-
query basis, minimum thresholds might be applied to
exclude clearly irrelevant sub-documents, should such
be within the N otherwise to be used in augmentation.
5.2 Source Sample Size
In the case of the submitted results of CLARIT-TREC-2
processing, the thesaurus for each routing query was
extracted from the set of all its known-relevant train-
ing documents. However, we can imagine that many
such documents contain som[OCRerr]perhaps a great deal
of-information that is not relevant to the topic at
hand. (This is especially expected with long docu-
ments.) Therefore, we have experimented with alterna-
tive approaches to sampling text from known-relevant
documents. In particular, we have used a ranking
of sub-documents (paragraphs) to nominate candidate
`good' relevant texts and have used variable numbers of
sub-documents as source text in thesaurus extraction.
In practice, to select source text for the thesaurus-
discovery process for a topic, we run the raw topic
vector as a query over the collection of relevant docu-
ments, which are partitioned into sub-documents, and
return only the top N relevant sub-documents. (In cases
where the topic has fewer than N sub-documents in the
collection of relevants, all sub-documents are selected.)
Our hypothesis is that thesauri extracted from relevant
sub-documents will contain greater numbers of `true-
positive' terms related to the topic. We have run the
experiment for N = 100, 200, 300,500, and 1000 as the
cutoff for selected sub-documents. Figure 11 gives a
sample of the results for routing topics, including the
best case of 500 sub-documents.
while the differences for different N are not great,
certain trends are evident. First, it is clear that using
only the more similar sub-documents from a collec-
tion of relevants gives better results than using all rele-
vant full documents. (This represents an important im-