SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Classification Trees for Document Routing, A Report on the TREC Experiment chapter R. Tong A. Winkler P. Gage National Institute of Standards and Technology Donna K. Harman * the number of training examples (Table 2 shows the size of the training sets-notice that for adsbal the set size was approximately 30 for each topic with approximately 10 relevant instances for each topic), and * the size of the optimal tree (Table 2 also shows the size of the optimal tree generated for each topic-this varied from a tree with only one terminal node to a tree with 10 terminal nodes, with the median being 2 terminal nodes). In fact, given the paucity of information actually used to generate the classification trees, perhaps the most surprising aspect of these results is how well some of the trees perform. There were many instances in which the adsbal trees generated the minimum response. However, there were several results around the median score and even one significantly above the median. As an example of a tree that performed moderately, consider the optimal tree for Topic 22 "Counternarcotics." This had only one decision node-a split on the word coca6. The actual tree is: class 0 (0.050) coca<=0 .50 class 1 (0.862) That is, the classification is based in the presence of absence of the word coca. If it is present, then the document is marked as "relevant" and the estimate of the probability of it being relevant is 0.862; if it is not present, then the document is marked as "non-rele- vant" but in fact still has a small probability (0.050) of being relevant. As we see from the table, this tree identifies 28 out of a possible 106 relevant documents, by the 200 docu- ment cut-off point. A tree that performed poorly is the one for Topic 9 "Candidate Sightings." Here the optimal tree had three decision nodes: class 0 (0.000) camp<=0 .50 class 0 (0.000) ran<=l .50 class 1 (0.948) sen<=9 .00 class 0 (0.000) Thus a relevant document is one which contains the word camp, the word sen (an abbreviation for Senator) nine times or less, and has a least two occurrences of the word ran. This tree only identified 13 of 157 relevance documents and was the worst perform- ing of the Category B systems. On the other hand, the tree for Topic 10 "MDS Treatments" performed very well. It 6. Recall that our feature extraction algorithm would match coca with any word with coca as a substring. So coca1 cocaine and Coca-Cola would all match. 217