SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Overview of the Second Text REtrieval Conference (TREC-2)
chapter
D. Harman
National Institute of Standards and Technology
D. K. Harman
Cornell University (see Buckley, Allan & Salton paper),
but using less than optimal term weightings (1,y mistake).
The 1tdortQ2'1 results from the University of Dortmund
come from using polynoinial regression on the training
data to find weights for various pre-set term features (see
Fuhr, Pfeifer, Brenikamp, Polimann & Buckley paper).
The "Brkly3'9 results from the Univ&sity of California at
Berkeley come froin performing logistic regression analy-
sis to learn optimal weighting for various term frequency
measures (see Cooper, Chen & Gey paper). The
"CLARTA" system from the CLARIT Corporation
expands each topic with noun phrases found in a the-
saurus that is automatically generated for each topic (see
Evans & Lefferts paper). The "lsiasm" results are from
Bellcore (see Dumals pape[OCRerr]. This group uses latent
semantic indexing to create much larger vectors than the
more traditional vector-space models such as SMART.
The run marked "lsiasm91 represents only the base
SMART pre-processing results, however. Due to process-
ing errors the "improved'9 LSI run produced unexpectedly
poor results.
Figure 6 shows the recall/precision curve for the six
TREC-2 groups with the highest non-interpolated average
precision using manual construction of queries. It should
be noted that varying amounts of manual intervention
were used. The results marked "INQOO2", "siems2", and
"CLARIM" are automatically generated queries with
manual modifications. The "INQOO2" results reflect vari-
ous manual modifications made to the "INQOOl" queries,
with those modifications guided by strict rules. The
"siems2" results from Siemens Corporate Research, Inc.
(see Voorhees paper) are based on the use of the Cornell
SMART system, but with the topics manually modified
(the "not91 phases removed). These results were meant to
be the base run for improvements using WordNet, but the
improvements did not materialize. The "CLARTM19
results represent manual weighting of the query terms, as
opposed to the automatic weighting of the terms that was
used in "CLARTA." The results marked "Vtcms2",
"CnQs[OCRerr]", and "TOPI[OCRerr]" are produced from queries con-
structed completely manually. The "Vtcms211 results are
from [OCRerr]nginia Tech (see Fox & Shaw paper) and show the
effects of combining the results from SMART vector-
space queries with the results from manually-constructed
soft Boolean P-Norm type queries. The "CnQs[OCRerr]" results,
from ConQuest Software (see Nelson paper), use a very
large general-purpose semantic net to ald in constructing
better queries from the topics, along with sophisticated
morphological analysis of the topics. The results marked
19T0P1C219 are from the TOPIC system by Verity Corp.
(see Lehman paper) and reflect the use of an expert sys-
tem working off specially[OCRerr]onstructed knowledge bases to
improve performance.
10
Several comments can be made with respect to these
adhoc results. First, the better results (most of the auto-
matic results and the three top manual results) are very
similar and it is unlikely that there is any statistical differ-
ences between them. There is clearly no "best" method,
and the fact that these systems have very different
approaches to retrieval, including different term weighting
schemes, different query construction methods, and differ-
ent similarity match methods implies that there is much
more to be learned about effective retrieval techniques.
As will be seen in section 6, whereas the averages for the
systems may be similar, the systems do better on different
topics and retrieve different subsets of the relevant docu-
ments.
A second point that should be made is that the automatic
query construction methods continue to perform as well
as the manual construction methods. Two groups (the
INQUERY system and the CLARIT system) did explicit
comparision of manually-modified queries vs those that
were not modified and concluded that manual modifica-
tion provided no benefits. The three sets of results based
on completely manually-generated queries had even
poorer performance than the manually-modified queries.
Note that this result is specific to the very rich TREC top-
ics; it is not clear that this will hold for the short topics
normally seen in other retrieval environments.
As a final point, it should be noted that these adhoc results
represent significant improvements over the results from
TREC-1. Figure 7 shows a comparison of results for a
typical system in TREC-1 and TREC-2. Some of this
improvement is due to improved evaluation, but the differ-
ence between the curve marked "TREC-1" and the curve
marked "TREC-2 looking at top 200 ouly" shows signifi-
cant performance improvement. Wbereas this
improvement could represent a difference in topics (the
TREC-l curve is for topics 51-100 and the TREC-2
curves are for topics 101-150), the TREC-2 topics are
generally felt to be more difficult and therefore this
improvement is likely to be an understatement of the
actual improvements.
Only two groups worked with less than the full document
collection. Figure 9 shows the results for the one group
with official TREC-2 category B results (the results from
UCLA were received after the deadline). This figure
shows the best results from New York University (see
Strzalkowski & Carballo paper), compared with a cate-
gory B version of the Cornell SMART results. The
"nyuir3" results reflect a very intensive use of natural lan-
guage processing (NLP) techniques, including a parse of
the documents to help locate syntactic phrases, context-
sensitive expansion of the queries, and other NLP
improvements on statistical techniques.