SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Classification Trees for Document Routing, A Report on the TREC Experiment chapter R. Tong A. Winkler P. Gage National Institute of Standards and Technology Donna K. Harman dence that the tree selection process is working as intended despite the small sample sizes. 4.4 The Use of Surrogate Split Information A basic problem with the use of classification trees in the TREC experimental envi- ronment is that although we use the probability of being relevant as a mechanism for ordering the system's output, this is a very weak method for generating a ranking. In fact, since we typically have very few "output bins" for any given tree, for most topics we effectively generated no ordering that would discriminate over the top 200 docu- ments. In this situation we need to look for a mechanism for generating a "finer-grained" ranking. Our second auxiliary experiment was designed to explore the use of the surrogate split information generated by the CART experiment as a way of ordering the output. Surrogate splits are a feature of CART used too deal with the problem of missing data. They define alternative splitting criteria in the case that the primary feature measure- ment is not available8. In TREC trees these appear as additional tests on the frequency of occurrence of words. So, for example, in adsbal the optimal tree for Topic 22 "Countern- arcotics" is: class 0 (0.050) coca<=0 .50 class 1 (0.862) and the alternatives to the split on the word c[OCRerr]ca are: cocaine<=0 .50 [1.00] colombia<=0 .50 (0.84) drug<=0 .5 [0.78] where the number in brackets indicate how correlated the surrogate split is with the optimal split. Thus in this example a split on coca and cocaine are identical in term of their ability to classify the documents. We also show the next two highly correlated splits. To use this information for output ranking we took the output from the adsbal tree for Topic 22 (i.e., the one shown above) and then for each document classified as relevant by this tree we gave it a weighted score based on the number of surrogate split tests it also satisfied. Thus if a document contained only the word coca it received a score of 1.00. If a document contained coca and colombia then it received a score of 1.84. In gen- eral a document received a score that was the sum of the correlation coefficients for those tests that were satisfied9. 8. Of course in the document routing problem addressed by TREC we do not have the notion of a missing measurement-a word is either present or not. However, it is easy to imagine document routing scenarios in which this does apply-for example, when working with noisy transmis- sions-so that the surrogate split information would be extremely useful. 223