NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)

SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Retrieval Experiments with a Large Collection using PIRCS chapter K. Kwok L. Papadopoulos K. Kwan National Institute of Standards and Technology Donna K. Harman PIRCS3f overwhelmingly improves over PIRCS 119 to 1 with 5 ties. This is because PIRCS 1 starts with low performance. PIRCS4f improves over PIRCS3f, but not with such wide margin, because PIRCS3f already achieves better results. It can been seen that feedback with or without query expansion within this collection is definitely worthwhile. Looking at the recall and precision at various retrieved document cut- off in the Appendix of this volume, it can be seen that PIRC54f performs better than PIRCS3f, except at the high precision cut-ff of 5. This reflects that query expansion is more of a recall enhancement tool. This can also be seen from the number of relevants retrieved at the 200 document cut-off: 1403, 1526 and 1582 respectively for PIRCS I ,3f,4f. At high precision region, the added terms could lead to more noise. The effect however is small, and may be related to the number and type of added terms. Thus, query expansion behaves like an automatic thesaurus, terms added being indirectly based on user relevance. (f) It is tempting to compare the results of PIRCS in this large WSJ collection with those in small collections. We have listed the ad hoc 10-pt Avg precision of the four standard collections popular with IR research [2] below, with that of WSJ PIRCS1 results: MED CRAN CACM WSJ IS' (3Oq, 1 .O3Kd) (225q, 1 .4OKd) (52q,3.2Kd) (25q,350Kd) (76q, 1 .46Kd) 0.472 0.374 0.297 0.263 0.174 MED (medicine) and CRAN (aerodynamics) use more specific, scientific vocabulary and can be expected to have better performance. CACM (computer science) terminology generally is less `scientific' and can often have general vocabulary; its performance and that of WSJ are similar. ISI (information science) has very broad nonspecific queries, and the vocabulary is quite general and has the worst retrieval results. We were afraid WSJ would have low performance like ISI. Listed below is a comparison between the CACM [2] and WSJ PIRCS 1 P-R curves: Recall: 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 WSJ: .78 .66 .55 .41 .31 .28 .19 .11 .07 .04 .01 CACM: .62 .50 .41 .36 .30 .24 .18 .15 .11 .09 A characteristic of WSJ is that precision falls off to much smaller values at high recall region, compared with CACM or other small collections. One should expect a large collection to have more noise, and it is interesting that this noise impacts predominantly the low signal region. For example, WSJ has generality ratio of only 0.00077 versus 0.00478 for CACM, i.e. more than six times as much noise in WSJ. Another reason for this phenomenon, perhaps to a lesser degree, is that of the incomplete evaluation procedure for documents ranked between 101 to 200 as discussed earlier. At low recall region, however, precision of WSJ is comparable or better than the small collections. Why is that? Our interpretation is that first, queries are much richer and better formed in WSJ compared to those in CACM. Second, when a collection is large, there is a very good chance that a number of relevant documents exist using closely the same terms as the queries describing their content, especially if the queries are well-worded. These documents will rank high, and hence precision at low recall does not suffer in spite of adverse generality ratio. At high recall region, relevant documents do not express their content in similar terms to the queries and have few term matches, and interference from poor generality ratio noise magnifies. Techniques to improve precision at high recall would therefore be very important if one needs exhaustive search. It has been quite popular to criticise IR research in that small collection results do not reflect those of large collections. Our hope is that as experience is gained with more of these large scale collections, we might be able to predict their behavior based on small collection results. Small collection experiments can be performed in a matter of hours, while large collections take days using current technology. It should be noted that all databases considered here are from one homogeneous field of knowledge. Many commercial database providers produce CDROMs that are homogeneous and of about the same size as WSJ, and these 163