NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)

SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Classification Trees for Document Routing, A Report on the TREC Experiment chapter R. Tong A. Winkler P. Gage National Institute of Standards and Technology Donna K. Harman to separate relevant from non-relevant documents. Finally, the tree for Topic 17 "Measures to Control Agrochemicals" does the best of all the Category B systems detecting 53 of the 69 relevant documents. The tree has one deci- sion node: class 0 (0.042) pes[OCRerr]icide<=0 .50 class 1 (0.920) That is, a test for the presence of the word pe[OCRerr]ticide-a surprisingly simple structure given it performance. 4.3 Sensitivity to the Choice of Optimal Tree Since the choice of optimal tree from the sequence of nested sub-trees is dependent on the cross-validation error estimates, and since the number of training samples is rather small, we might expect that there is a possibility that the tree selection process is in fact in error. To explore the sensitivity of the system's performance to errors in select- ing the optimal tree we ran an auxiliary experiment for Topic 22 "Counternarcotics". As in the official experiments, we used the adsbal dataset to generate a sequence of sub-trees, but instead of selecting just one (i.e., T*), we saved all the trees. We then used each tree to classify the test data and tabulated the results. These are show in Table 4. The Table 4: Performance as a Function of Tree Size Rel-Ret Recall Precision Tree No. Tree Size [OCRerr] @ 200 @ 200 @ 200 1 6 0.3932 12 0.1132 0.0600 2 4 0.5004 1 0.0094 0.0050 3 3 0.3931 24 0.2264 0.1200 4 (T*) 2 0.4645 28 0.2642 0.1400 5 1 0.7866 - - columns of the table record the tree identifier (Tree No.), the size of the tree in terms of the number of terminal nodes in the tree (Tree Size), the cross-validation error estimate [OCRerr] the number of relevant documents retrieved (Rel-Ret @ 200), and the recall and precision at the cut-off (Recall Oa 200 and Precision @ 200). For this topic, the maximum tree (T1) has six terminal nodes and an estimated error rate of 39%. The minimum tree (T5) has oniy one terminal node-which in this case clas- sifies all document as non-relevant-and an estimated error rate of 79%. The optimal tree (T4) has two terminal nodes but an estimated error rate of 46%. We see that the optimal tree did in fact generate the best result, increasing our confi- 222