SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Retrieval Experiments with a Large Collection using PIRCS
chapter
K. Kwok
L. Papadopoulos
K. Kwan
National Institute of Standards and Technology
Donna K. Harman
PIRCS3f overwhelmingly improves over PIRCS 119 to 1 with 5 ties. This is because PIRCS 1 starts with
low performance. PIRCS4f improves over PIRCS3f, but not with such wide margin, because PIRCS3f
already achieves better results. It can been seen that feedback with or without query expansion within this
collection is definitely worthwhile. Looking at the recall and precision at various retrieved document cut-
off in the Appendix of this volume, it can be seen that PIRC54f performs better than PIRCS3f, except at
the high precision cut-ff of 5. This reflects that query expansion is more of a recall enhancement tool.
This can also be seen from the number of relevants retrieved at the 200 document cut-off: 1403, 1526 and
1582 respectively for PIRCS I ,3f,4f. At high precision region, the added terms could lead to more noise.
The effect however is small, and may be related to the number and type of added terms. Thus, query
expansion behaves like an automatic thesaurus, terms added being indirectly based on user relevance.
(f) It is tempting to compare the results of PIRCS in this large WSJ collection with those in small
collections. We have listed the ad hoc 10-pt Avg precision of the four standard collections popular with
IR research [2] below, with that of WSJ PIRCS1 results:
MED CRAN CACM WSJ IS'
(3Oq, 1 .O3Kd) (225q, 1 .4OKd) (52q,3.2Kd) (25q,350Kd) (76q, 1 .46Kd)
0.472 0.374 0.297 0.263 0.174
MED (medicine) and CRAN (aerodynamics) use more specific, scientific vocabulary and can be expected
to have better performance. CACM (computer science) terminology generally is less `scientific' and can
often have general vocabulary; its performance and that of WSJ are similar. ISI (information science) has
very broad nonspecific queries, and the vocabulary is quite general and has the worst retrieval results.
We were afraid WSJ would have low performance like ISI. Listed below is a comparison between the
CACM [2] and WSJ PIRCS 1 P-R curves:
Recall: 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
WSJ: .78 .66 .55 .41 .31 .28 .19 .11 .07 .04 .01
CACM: .62 .50 .41 .36 .30 .24 .18 .15 .11 .09
A characteristic of WSJ is that precision falls off to much smaller values at high recall region, compared
with CACM or other small collections. One should expect a large collection to have more noise, and it
is interesting that this noise impacts predominantly the low signal region. For example, WSJ has
generality ratio of only 0.00077 versus 0.00478 for CACM, i.e. more than six times as much noise in
WSJ. Another reason for this phenomenon, perhaps to a lesser degree, is that of the incomplete evaluation
procedure for documents ranked between 101 to 200 as discussed earlier. At low recall region, however,
precision of WSJ is comparable or better than the small collections. Why is that? Our interpretation is
that first, queries are much richer and better formed in WSJ compared to those in CACM. Second, when
a collection is large, there is a very good chance that a number of relevant documents exist using closely
the same terms as the queries describing their content, especially if the queries are well-worded. These
documents will rank high, and hence precision at low recall does not suffer in spite of adverse generality
ratio. At high recall region, relevant documents do not express their content in similar terms to the queries
and have few term matches, and interference from poor generality ratio noise magnifies. Techniques to
improve precision at high recall would therefore be very important if one needs exhaustive search. It has
been quite popular to criticise IR research in that small collection results do not reflect those of large
collections. Our hope is that as experience is gained with more of these large scale collections, we might
be able to predict their behavior based on small collection results. Small collection experiments can be
performed in a matter of hours, while large collections take days using current technology. It should be
noted that all databases considered here are from one homogeneous field of knowledge. Many commercial
database providers produce CDROMs that are homogeneous and of about the same size as WSJ, and these
163