SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Overview of the Second Text REtrieval Conference (TREC-2)
chapter
D. Harman
National Institute of Standards and Technology
D. K. Harman
53 Routing Results
The routing evaluation used a subset of the training topics
(topics 51-100 were used) against the new disk of test
documents (disk 3). There were 40 sets of results for
routing evaluation, with 32 of them based on runs for the
flill data set. of the 32 systems using the flill data set, 23
used automatic construction of queries, and 9 used man-
ual construction.
Figure 9 shows the recall/precision curves for the six
TREC-2 groups with the highest non-interpolated average
precision using automatic construction of the routing
queries. Again three systems are based on the Cornell
SMART system. The plot marked "crnlCl" is the actual
SMART system, using the basic Rocchio relevance feed-
back algorithms, and adding many terms (up to 500) from
the relevant traing documents to the terms in the topic.
The "dortPl" results come from using a probabilistically-
based relevance feedback instead of the vector-space algo-
rithm, and adding only 20 terms from the relevant docu-
ments to each query. These two systems have the best
routing results. The "Brkly5" system uses logistic regres-
sion on both the general frequency variables used in their
adhoc approach and on the query-specific relevance data
available for training with the routing topics. The results
marked "cityr2" are from City University, London (see
Robertson, Walker, Jones, Hancock-Beaulieu & Gafford
paper). This group automatically selected variable num-
bers of terms (1[OCRerr]25) from the training documents for
each topic (the topics themselves were not used as term
sources), and then used traditional probabilistic reweight-
ing to weight these terms. The "INQOO3" results also use
probabilistic reweighting, but use the topic terms,
expanded by 30 new terms per topic from the training
documents. The results marked "lsir2" are more latent
semantic indexing results from Beilcore. This run was
made by creating a filter of the singular-value decomposi-
tion vector sum or centroid of all relevant documents for a
topic (and ignoring the topic itself).
Figure 10 shows the recall4,recision curves for the six
TREC-2 groups with the highest non-interpolated average
precision using manual construction of the routing
queries. The results marked "INQOOI" are from the
INQRY system using an inferential combination of the
"INQ003" queries and manually modified queries created
from the topic. The "trw2" results represent an adaptation
of the TRW Fast Data Finder pattern matching system to
allow use of term weighting (see Mettler paper). The
queries were manually constructed and the term weight-
ing was learned from the training data. The "geerdi"
results from GE Research and Development Center (see
Jacobs paper) also come from manually constructed
queries, but using a general-purpose lexicon and the train-
ing data to suggest input to the Boolean pattern matcher.
13
The results marked "CLARThI" are similar to the
"CLARTM" adhoc results except that the training docu-
ments were used as the source for thesaurus building, as
opposed to using the top set of retrieved documents. The
"rutcombx" results from Rutgers University (see Belitin,
Kantor, Cool & Quatrain paper) come from combining 5
sets of manually generated Boolean queries to optimize
performance for each topic. The results marked
"TOPIC2" are from the TOPIC system and reflect the use
of an expert system working off specially-constructed
knowledge bases to improve performance.
As was the case with the adhoc topics, the automatic
query construction methods continue to perform as well
as, or m this case, better than the manual construction
methods. A comparision of the two INQRY runs illus-
trates this point and shows that all six results with manu-
ally generated queries perform worse than the six runs
with automatically-generated queries. The availability of
the training data allows an automatic tuning of the queries
that would be difficult to duplicate manually without
extensive analysis.
Unlike the adhoc results, there are two runs ("crnlCl" and
"dor[OCRerr]1") that are clearly better than the others, with a sig-
nificant difference between the "crnlCl" results and the
"do[OCRerr]1" results and also significant differences between
these results and the rest of the automatically-generated
query results. In particular the Cornell group's ability to
effectively use many terms (up to 500) for query expan-
sion was one of the most interesting findings in I1[OCRerr]EC-2
and represents a departure from past results (see Buckley,
Allan, & Salton paper for more on this).
As a final point, it should be noted that the routing results
also represent significant improvements over the results
from ThEC-l. Figure 11 shows a comparison of results
for a typical system in JREC- 1 and ThEC-2. Some of
this improvement is due to the improved evaluation tech-
niques, but the difference between the curve marked
"ThEC-1" and the curve marked "TREC-2 looking at top
200 only" shows significant performance improvement.
There is even more improvement for the routing results
than for the adlioc results, due to better training data
(mosfly non-existent for ThEC-1) and to major efforts by
many groups in new routing algorithm experiments.
Only four groups worked with less than the full document
collection. Figure 12 shows the results for two of the
groups m category B compared with a category B version
of the Cornell SMART results. These curves show the
results of runs from New York University (that were done
in a similar method as that used for the adhoc results) and
results from Dathousie University.