SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Overview of the Second Text REtrieval Conference (TREC-2)
chapter
D. Harman
National Institute of Standards and Technology
D. K. Harman
high accuracy or precision, and at the final stage of
retrieval where there is usually a low accuracy, but more
complete retrieval. Note that the use of these curves
assumes a ranked output from a system. Systems that
provide an unranked set of docuttients are known to be
less effective and therefore were not tested in the TREC
program.
The curves m figure 2 show that system A has a much
higher precision at the low recall end of the graph and
therefore is more accurate. System B however has higher
precision at the high recall end of the curve and therefore
will give a more complete set of relevant documents,
assuming that the user is willing to look further in the
ranked list.
A second set of curves was calculated using the
recall/fallout measures, where recall is defined as before
and fallout is defined as
number of nonrelevant items retrieved
fallout = total number of nonrelevant items in collection
Note that recall has the same definition as the probability
of detection and that fallout has the same definition as the
probability of false alarm, so that the recall/fallout curves
are also the ROC [OCRerr]elative Operating characteristic)
curves used in signal processing. A sample set of curves
corresponding to the recall/precision curves is shown in
figure 3. These curves show the same order of perfor-
mance as do the recall/precision curves and are provided
as an alternative method of viewing the results. The pre-
sent version of the curves is experimental as the curve cre-
ation is particularly sensitive to scaling (what range is
used for calculating fallout). The high precision section
of the curves does not show well in figure 3; the high
recall area dominates the curves.
Whereas the recall/precision curves show the retrieval
system results as they might be seen by a user (sicce pre-
cision measures the accuracy of each retrieved document
as it is retrieved), the recall/fallout curves emphasize the
ability of these systems to screen out non-relevant mate-
rial. In particular the fallout measure shows the discrima-
tion powers of these systems on a large document collec-
tion. For example, system A has a fallout of 0.02 at a
recall of about 0.48; this means that this system has
found almost 50% of the relevant documents, while only
retrieving 2% of the non-relevant documents.
42 Single-Value Evaluafion Measures
In addition to recall/precision and recall/fallout curves,
there were 2 single-value measures used in TREC-2.
The first measure, the non-interpolated average precision,
corresponds to the area under an ideal (non-interpolated)
recall/precision curve. To compute this average, a
7
precision average for each topic is first calculated. This is
done by computing the precision after every retrieved rel-
evant document and then averaging these precisions over
the total number of retrieved relevant documents for that
topic. These topic averages are then combined (averaged)
across all topics in the appropriate set to create the non-
interpolated average precision for that set.
The second measure used is an average of the precision
for each topic after 100 documents have been retrieved for
that topic. This measure is useful because it reflects a
clearly comprehended retrieval point. It took on added
impor[OCRerr][OCRerr]e in the TREC environment because only the
top 100 documents retrieved for each topic were actually
assessed. For this reason it produces a guaranteed evalua-
tion point for each system.
4.3 Problems with Evaluation
Since this was the first time that such a large collection of
text has been used in open system evaluation, there were
some problems with the existing methods of evaluation.
The major problem concerned a turesholding effect
caused by the inability to evaluate ALL documents
retrieved by a given system.
For TREC- 1 the groups were asked to send in only the top
200 documents retrieved by their systems. This artificial
document cutoff is relatively low and systems did not
retrieve all the relevant documents for most topics within
the cutoff. All documents retrieved beyond the 200 mark
were considered nonrelevant by default and therefore the
recall/precision curves became maccurate after about 40%
recall on average. TREC-2 used the top 1000 documents
for evaluation. Figure 4 shows the difference in the
curves produced by various evaluation thresholds, includ-
ing a curve for no threshold (similar to the way evaluation
has been done on the smaller collections.). These curves
show that the use of a 1000-document cutoff has solved
most of the thresholding problem.
Two more issues in evaluation have become important.
The first issue involves the need for more statistical evalu-
ation. As will be seen in the results, the recall/precision
curves are often close, and there is a need to check if there
is truly any statistically significant differences between
two systems' results or two sets of results from the same
system. This problem is currently under investigation in
collaboration with statistical groups experienced in the
evaluation of information retrieval systems.
Another issue involves getting beyond the averages to bet-
ter understand system performance. Because of the huge
number of documents and the long topics, it is very