NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)

SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Multilevel Ranking in Large Text Collections Using FAIRS chapter S-C. Chang H. Dediu H. Azzam M-W. Du National Institute of Standards and Technology Donna K. Harman Table 6: AW2 Recall/Precision Performance rn Best Worst [OCRerr] [OCRerr]e Recall 17(34%) 22(44%)_[OCRerr] 11 Lilt.[_22(44%) 26 (52%)_[ 2 For those queries for which there was a tie in recall values, there are two queries which had 0 records judged relevant. Of the remaining 9, we considered the 11-pt. average as a tie-breaker. The result was 6 best, 3 worst. Combining the recall and 11-pt. averages, for AW2, FAIRS had 23 sub- missions on or above the median (46%), 25 below (50%) (4% undetermined). 3A.3 RB Routing, Category B results were submitted by 7 systems. For those topics judged, 3,766 documents were considered relevant, 5,000 were submitted by FAIRS in response to 25 queries. Of those submissions, 1,124 were among the relevant. The distribution of relevant retrieved (recall) over the 25 topics was: 2 ranked best, 13 ranked above the median, 6 on the median, 4 below, for a total of 21 on or above the median, 4 below. The following graph illustrates the performance index of the recall rates of FAIRS compared to the group. It shows FAIRS is above average most of the time. The average recall P! of FAIRS is 65.8. 1 3 5 7 9 11 13 15 i7 19 21 23 25 Recall [OCRerr]Qiaery The next graph shows the performance index of the 11- point average of FAIRS compared to the group. It again shows FAIRS to be above average most of the time. The average 11-point-average P1 of FAIRS is 61.8. 100 50 Table 7: RB RecalVprecision Performance I [OCRerr]tion to Median > I = 1 Recall [OCRerr]_15(60%) 6(24%) :4(16%) 1 11-pt. [OCRerr] 15(60%) 6(24%) 4(16%)] This is the only group that had enough participants to make a comparative-performance analysis meaningful. We compared our 11-point average and recall rates for each query to the best, the median, and the worst scores of that query. The performance index (P!) is calculated as fol- lows: ____ (score-median[OCRerr] score > median 50+50 I P1 = bess - median score - w:orsrs:s), score <median P! has the property that a value of 100 means the best is achieved, and a 50 means the performance is on the median, and a 0 means it is the worst. 335 0 i 3 5 7 9 11 13 15 17 19 21 23 25 11-point average IQuery 3.5 Failure Analysis Based on the feedback from relevance judgements, we are considering several improvements in the query handling and ranking methods. These changes include: 1. Expanding of terms in the topics which are abbreviated via an abbreviation dictionary. Initial investigation of topics which have abbreviations reveals that those abbreviations had an appreciably negative impact on recall. Topic 17 is a good example, where the term "United States" is abbreviated as "U.S.", In a later trial, this simple expansion alone significantly improved the recall rate for this topic. 2. Using better term weighting based on heuristics. Up to 50% improvement was observed when term weighting was modified more intultively (by hand.)