SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Design and Evaluation of the CLARIT-TREC-2 System
chapter
D. Evans
R. Lefferts
National Institute of Standards and Technology
D. K. Harman
2.2.2 Query Augmentation
CLARIT thesaurus discovery is a process that can iden-
tify a set of core, representative terminology in a col-
lection of documents. If the documents are relatively
homogeneous topically, the resulting `first-order' the-
saurus can be regarded as a broad `signature' of the
topic of the collection. Typically, the procedure is most
reliable when the collection is large-2 megabytes or
more of text. In the CLARIT-TREC-2 system, how-
ever, thesaurus discovery was used on the relatively
small collections of relevant documents for each topic.
For each routing topic, the document collection used to
establish a first-order thesaurus consisted of the train-
ing set of relevant documents for each topic; for each
ad-hoc topic, a set of the top-ranked sub-documents
from among the documents automatically retrieved via
an inltial pass of querying over the corpus. Terms in
the discovered thesaurus were used to supplement the
terms in the source query (topic). Thus, in the case of
ad-hoc topics, the procedure represents an approach to
query augmentation based on fully automatic feedback.
2.2.3 Text Analysis Heuristics
Texts can have complex and interesting discourse struc-
ture, which often provides clues about the `topic(s)' and
the `important' information in a document. In general,
it is very difficult to exploit text-structure information
reliably in retrieval tasks over large collections of het-
erogeneous documents. Nevertheless, the CLARIT-
TREC-2 system applied two simple processing tech-
niques to topics and corpus documents to attempt to
capture information encoded in rhetorical structure.
First, all training and corpus documents were
divided into paragraph-sized units called "sub-
documents". This procedure is sensitive both to the
`normal' demarcation of paragraphs (successive blank
lines) and also to the total length of the text (measured in
numbers of sentences). After documents are partioned
into sub-documents, the sub-document is taken as the
basic unit for subsequent processing, viz., in collecting
statistics and in scoring retrieval.
Second, terms extracted from a topic description
were assigned importance coefficients based on their
locations in the topic text. Terms found in the first
paragraph are given a weight of 3; terms in the second
paragraph, 2; and all other terms, 1.
Of course, both techniques exploit possibly idiosyn-
cratic characterisitics of the TREC-2 processing task.
The use of scoring over sub-documents is clearly sen-
sitive to the TREC definition of document relevancy,
viz., that a document is relevant regardless of length if
it contains a single relevant sentence. Furthermore, au-
139
tomatic term importance weighting is possible only be-
cause of the formal discourse structure of TREC topics;
it would not neccessarily apply to other presentations
of topics or queries and certainly would not apply to
free text in general.
We observe, however, that there is no one set
of techniques that will perform optimally in every
information-retrieval (IR) situation. We emphasize, in-
stead, that one important measure of a system's utility
and adaptability is its ability to take advantage of ex-
ploitable features in a given IR task.
2.3 System Performance Notes
All CLARiT-TREC-2 processing took place on DEC
3000/400 (ALPHA/AXP) workstations ruling DEC
0SF/i. One system had 128 megabytes of RAM, one
64 megabytes, and two 32 megabytes. Realized per-
formance was considerably slower than would be ex-
pected given the clockrate (133.33 MHz) of the DEC
3000/400's CPU. In fact, because of suboptimal com-
pilers for the 64-bit architecture, performance was per-
haps two times slower than the maximum possible.
One pass over the entire TREC-2 collection for 50 top-
ics (processed simultaneously) required approximately
four hours, rurining on the four machines in parallel.
The processing of a single topic is proportionally faster,
requiring approximately 20 minutes on a single ma-
chine.
3 Results
The CLARiT team processed all the topics in both the
"routing" and "ad-hoc" categories and worked with
the full set of data. Two sets of results were submit-
ted for each category, corresponding to the "manual"
("CLARTM") and fully-"automatic" ("CLARTA") pro-
cessing approaches taken with the topics.
3.1 General Summary of Official Results
Table 1 gives the official CLARIT-TREC-2 system rout-
ing results as reported by MST. A graph of the
precision-recall curves for the two sets of results is
given in Figure 2. The total number of documents
retrieved under the routing task was 6,785 (CLARTM)
and 6,811 (CLARTA), representing, respectively, 64.69%
and 64.93% of the total known relevants (10,489).