SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Classification Trees for Document Routing, A Report on the TREC Experiment
chapter
R. Tong
A. Winkler
P. Gage
National Institute of Standards and Technology
Donna K. Harman
* the number of training examples (Table 2 shows the size of the training
sets-notice that for adsbal the set size was approximately 30 for each
topic with approximately 10 relevant instances for each topic), and
* the size of the optimal tree (Table 2 also shows the size of the optimal tree
generated for each topic-this varied from a tree with only one terminal
node to a tree with 10 terminal nodes, with the median being 2 terminal
nodes).
In fact, given the paucity of information actually used to generate the classification
trees, perhaps the most surprising aspect of these results is how well some of the trees
perform. There were many instances in which the adsbal trees generated the minimum
response. However, there were several results around the median score and even one
significantly above the median.
As an example of a tree that performed moderately, consider the optimal tree for
Topic 22 "Counternarcotics." This had only one decision node-a split on the word
coca6. The actual tree is:
class 0 (0.050)
coca<=0 .50
class 1 (0.862)
That is, the classification is based in the presence of absence of the word coca. If it is
present, then the document is marked as "relevant" and the estimate of the probability of
it being relevant is 0.862; if it is not present, then the document is marked as "non-rele-
vant" but in fact still has a small probability (0.050) of being relevant. As we see from the
table, this tree identifies 28 out of a possible 106 relevant documents, by the 200 docu-
ment cut-off point.
A tree that performed poorly is the one for Topic 9 "Candidate Sightings." Here the
optimal tree had three decision nodes:
class 0 (0.000)
camp<=0 .50
class 0 (0.000)
ran<=l .50
class 1 (0.948)
sen<=9 .00
class 0 (0.000)
Thus a relevant document is one which contains the word camp, the word sen (an
abbreviation for Senator) nine times or less, and has a least two occurrences of the word
ran. This tree only identified 13 of 157 relevance documents and was the worst perform-
ing of the Category B systems.
On the other hand, the tree for Topic 10 "MDS Treatments" performed very well. It
6. Recall that our feature extraction algorithm would match coca with any word with coca as a
substring. So coca1 cocaine and Coca-Cola would all match.
217