SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Classification Trees for Document Routing, A Report on the TREC Experiment
chapter
R. Tong
A. Winkler
P. Gage
National Institute of Standards and Technology
Donna K. Harman
dence that the tree selection process is working as intended despite the small sample
sizes.
4.4 The Use of Surrogate Split Information
A basic problem with the use of classification trees in the TREC experimental envi-
ronment is that although we use the probability of being relevant as a mechanism for
ordering the system's output, this is a very weak method for generating a ranking. In
fact, since we typically have very few "output bins" for any given tree, for most topics
we effectively generated no ordering that would discriminate over the top 200 docu-
ments. In this situation we need to look for a mechanism for generating a "finer-grained"
ranking.
Our second auxiliary experiment was designed to explore the use of the surrogate
split information generated by the CART experiment as a way of ordering the output.
Surrogate splits are a feature of CART used too deal with the problem of missing data.
They define alternative splitting criteria in the case that the primary feature measure-
ment is not available8. In TREC trees these appear as additional tests on the frequency of
occurrence of words. So, for example, in adsbal the optimal tree for Topic 22 "Countern-
arcotics" is:
class 0 (0.050)
coca<=0 .50
class 1 (0.862)
and the alternatives to the split on the word c[OCRerr]ca are:
cocaine<=0 .50 [1.00]
colombia<=0 .50 (0.84)
drug<=0 .5 [0.78]
where the number in brackets indicate how correlated the surrogate split is with the
optimal split. Thus in this example a split on coca and cocaine are identical in term of
their ability to classify the documents. We also show the next two highly correlated
splits.
To use this information for output ranking we took the output from the adsbal tree
for Topic 22 (i.e., the one shown above) and then for each document classified as relevant
by this tree we gave it a weighted score based on the number of surrogate split tests it
also satisfied. Thus if a document contained only the word coca it received a score of
1.00. If a document contained coca and colombia then it received a score of 1.84. In gen-
eral a document received a score that was the sum of the correlation coefficients for those
tests that were satisfied9.
8. Of course in the document routing problem addressed by TREC we do not have the notion of a
missing measurement-a word is either present or not. However, it is easy to imagine document
routing scenarios in which this does apply-for example, when working with noisy transmis-
sions-so that the surrogate split information would be extremely useful.
223