SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Overview of the Second Text REtrieval Conference (TREC-2) chapter D. Harman National Institute of Standards and Technology D. K. Harman high accuracy or precision, and at the final stage of retrieval where there is usually a low accuracy, but more complete retrieval. Note that the use of these curves assumes a ranked output from a system. Systems that provide an unranked set of docuttients are known to be less effective and therefore were not tested in the TREC program. The curves m figure 2 show that system A has a much higher precision at the low recall end of the graph and therefore is more accurate. System B however has higher precision at the high recall end of the curve and therefore will give a more complete set of relevant documents, assuming that the user is willing to look further in the ranked list. A second set of curves was calculated using the recall/fallout measures, where recall is defined as before and fallout is defined as number of nonrelevant items retrieved fallout = total number of nonrelevant items in collection Note that recall has the same definition as the probability of detection and that fallout has the same definition as the probability of false alarm, so that the recall/fallout curves are also the ROC [OCRerr]elative Operating characteristic) curves used in signal processing. A sample set of curves corresponding to the recall/precision curves is shown in figure 3. These curves show the same order of perfor- mance as do the recall/precision curves and are provided as an alternative method of viewing the results. The pre- sent version of the curves is experimental as the curve cre- ation is particularly sensitive to scaling (what range is used for calculating fallout). The high precision section of the curves does not show well in figure 3; the high recall area dominates the curves. Whereas the recall/precision curves show the retrieval system results as they might be seen by a user (sicce pre- cision measures the accuracy of each retrieved document as it is retrieved), the recall/fallout curves emphasize the ability of these systems to screen out non-relevant mate- rial. In particular the fallout measure shows the discrima- tion powers of these systems on a large document collec- tion. For example, system A has a fallout of 0.02 at a recall of about 0.48; this means that this system has found almost 50% of the relevant documents, while only retrieving 2% of the non-relevant documents. 42 Single-Value Evaluafion Measures In addition to recall/precision and recall/fallout curves, there were 2 single-value measures used in TREC-2. The first measure, the non-interpolated average precision, corresponds to the area under an ideal (non-interpolated) recall/precision curve. To compute this average, a 7 precision average for each topic is first calculated. This is done by computing the precision after every retrieved rel- evant document and then averaging these precisions over the total number of retrieved relevant documents for that topic. These topic averages are then combined (averaged) across all topics in the appropriate set to create the non- interpolated average precision for that set. The second measure used is an average of the precision for each topic after 100 documents have been retrieved for that topic. This measure is useful because it reflects a clearly comprehended retrieval point. It took on added impor[OCRerr][OCRerr]e in the TREC environment because only the top 100 documents retrieved for each topic were actually assessed. For this reason it produces a guaranteed evalua- tion point for each system. 4.3 Problems with Evaluation Since this was the first time that such a large collection of text has been used in open system evaluation, there were some problems with the existing methods of evaluation. The major problem concerned a turesholding effect caused by the inability to evaluate ALL documents retrieved by a given system. For TREC- 1 the groups were asked to send in only the top 200 documents retrieved by their systems. This artificial document cutoff is relatively low and systems did not retrieve all the relevant documents for most topics within the cutoff. All documents retrieved beyond the 200 mark were considered nonrelevant by default and therefore the recall/precision curves became maccurate after about 40% recall on average. TREC-2 used the top 1000 documents for evaluation. Figure 4 shows the difference in the curves produced by various evaluation thresholds, includ- ing a curve for no threshold (similar to the way evaluation has been done on the smaller collections.). These curves show that the use of a 1000-document cutoff has solved most of the thresholding problem. Two more issues in evaluation have become important. The first issue involves the need for more statistical evalu- ation. As will be seen in the results, the recall/precision curves are often close, and there is a need to check if there is truly any statistically significant differences between two systems' results or two sets of results from the same system. This problem is currently under investigation in collaboration with statistical groups experienced in the evaluation of information retrieval systems. Another issue involves getting beyond the averages to bet- ter understand system performance. Because of the huge number of documents and the long topics, it is very