SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Classification Trees for Document Routing, A Report on the TREC Experiment
chapter
R. Tong
A. Winkler
P. Gage
National Institute of Standards and Technology
Donna K. Harman
to separate relevant from non-relevant documents.
Finally, the tree for Topic 17 "Measures to Control Agrochemicals" does the best of all
the Category B systems detecting 53 of the 69 relevant documents. The tree has one deci-
sion node:
class 0 (0.042)
pes[OCRerr]icide<=0 .50
class 1 (0.920)
That is, a test for the presence of the word pe[OCRerr]ticide-a surprisingly simple structure
given it performance.
4.3 Sensitivity to the Choice of Optimal Tree
Since the choice of optimal tree from the sequence of nested sub-trees is dependent
on the cross-validation error estimates, and since the number of training samples is
rather small, we might expect that there is a possibility that the tree selection process is
in fact in error. To explore the sensitivity of the system's performance to errors in select-
ing the optimal tree we ran an auxiliary experiment for Topic 22 "Counternarcotics".
As in the official experiments, we used the adsbal dataset to generate a sequence of
sub-trees, but instead of selecting just one (i.e., T*), we saved all the trees. We then used
each tree to classify the test data and tabulated the results. These are show in Table 4. The
Table 4: Performance as a Function of Tree Size
Rel-Ret Recall Precision
Tree No. Tree Size [OCRerr] @ 200 @ 200 @ 200
1 6 0.3932 12 0.1132 0.0600
2 4 0.5004 1 0.0094 0.0050
3 3 0.3931 24 0.2264 0.1200
4 (T*) 2 0.4645 28 0.2642 0.1400
5 1 0.7866 - -
columns of the table record the tree identifier (Tree No.), the size of the tree in terms of
the number of terminal nodes in the tree (Tree Size), the cross-validation error estimate
[OCRerr] the number of relevant documents retrieved (Rel-Ret @ 200), and the recall and
precision at the cut-off (Recall Oa 200 and Precision @ 200).
For this topic, the maximum tree (T1) has six terminal nodes and an estimated error
rate of 39%. The minimum tree (T5) has oniy one terminal node-which in this case clas-
sifies all document as non-relevant-and an estimated error rate of 79%. The optimal tree
(T4) has two terminal nodes but an estimated error rate of 46%.
We see that the optimal tree did in fact generate the best result, increasing our confi-
222