NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)

SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Retrieval Experiments with a Large Collection using PIRCS chapter K. Kwok L. Papadopoulos K. Kwan National Institute of Standards and Technology Donna K. Harman (b) Some of the topics have very specific requirements for documents to be relevant. For example: Topic #1 needs antitrust cases as a result of complaint, not routine review; #2 needs acquisitions between a U.S. company and another non-U.S. company; #53 needs leveraged buyout cases valued at or above $200 million; while #60 requires a policy change from merit-pay vs. seniority or vice versa. Data like `above $200 million' or other numerics are removed either because they are on our stopword list or because of high frequency. Other topics involve very general concepts that require the system to understand their specific inferences. Examples are: course of action to decrease the U.S. deficit (#7); the body of water being polluted (#12); specific commercial applications of superconductors (#21); hypocritical and conflicting policies of the U.S. government (#74). The possible `course of action', `commercial applications', `conflicting policies', etc. are essentially open-ended. Yet others need synonym lists or other aids to interpret proper terms in order not to miss documents. Examples are: Japanese, U.S. or foreign companies (#2,3), European Community or countries (#5, #69), third world or developing countries (#4,6), or economic indicators (#8). When would a proper noun, if identifiable, represents a company? And if it is, is it a foreign company? Also, a few of the topics like #3, 15, 53, 56, 66 have short descriptions with many general words, so that after stemming and stop-word processing, these queries end up with few terms. PIRCS does not have tools for these problems. Its precision values at 50% recall and at 11-pt Avg for ad hoc and routing retrievals are tabulated below: Ad Hoc PIRCS1 PIRCS2 Routing PIRCS 1 PIRCS2 ------------------------------------------------ 50% Recall I .276 .278 I .340 .342 11-pt Avg I .311 .322 I .343 .369 The ad hoc precision value of about 0.28 at 50% recall says that, averaged over 25 queries, if one wants to retrieve half of all relevant documents, one would have to read about eleven documents to get three relevant, and about nine documents to get three relevants for routing. Routing queries receive some help from the few relevants provided for training purposes, just as in experiments first reported in [8] where we simulate users posing queries equiped with some known relevants. The 11-pt Avg precison values sample eleven recall points and simulate a uniform distribution of users with different recall needs, and may reflect actual usage better. Effectiveness improves to between three in ten and three in nine retrieved documents being relevant for ad hoc, and slighdy better for routing. These results naturally leave much room for improvement; but considering that PIRCSl is fully automatic and relies only on statistical methods, results seem reasonable for this large WSJ collection. See also analysis in (c) and (f). (c) Another evaluation of the system is to look at the precision-recall values at different cut-off points of 5, 15, 30, 100 and 200 retrieved documents. This may give users a better `feel' than the hypothetical 11-pt Avg. A question is what should these values be compared to? The theoretical limit is of course 1.0, for perfect recall and precision. However, this would punish the system unfaidy. For example, at 15 retrieved documents, many queries have x relevants with x> 15. Hence the best recall at this cut-off should be iSix. This we call the best operational recall in contrast to the theoretical best of 1.0. Similarly, if x <15[OCRerr] these queries would have the best operational precision at this cut-off of x/15, instead of 1.0. We have listed below for PIRCS 1 the precision-recall values at various cut-offs and also the best operational values for comparison: 5 15 30 100 200 Cut-off: Ad Hoc Best Oper. Recall: .146 .298 .456 .828 .916 Recall: .066 .130 .204 .419 .586 45% 44% 45% 51% 64% 161