SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Classification Trees for Document Routing, A Report on the TREC Experiment chapter R. Tong A. Winkler P. Gage National Institute of Standards and Technology Donna K. Harman matic) is at least encouraging and definitely acceptable in several instances. Some specific observations on the performance of the current implementation of the CART algorithm are: * Relying on the re-substitution estimates for the terminal nodes is a very weak method for producing an output ranking. The estimates themselves are not very good and when combined with optimal trees that emphasize recall over precision give a largely undifferentiated output. As we noted above, a scheme that makes use of surrogate split information to generate a post hoc ranking shows much promise as a technique for improving our scores in the TREC context. * While our approach is totally automatic, it is restricted to using as fea- tures only those words that appear in the information need statement. This is obviously a limitation since the use of even simple query expan- sion techniques (e.g., stemming and/or a synonym dictionary) is likely to provide a richer and more effective set of initial features. * Using words as features is possibly too "low-level" to ever allow stable, robust classification trees to be produced. At a minimum, we probably need to consider working with concepts rather than individual words. Not only would this reduce the size of the feature space but would proba- bly result in more intuitive trees. The disadvantage of this that it is not clear where the concepts would come from, other than from a manually constructed knowledge-base of some sort. * We need to work with much bigger and more representative training sets. Our preliminary experiment in this area shows, not surprisingly, that adding more training examples can lead to dramatic changes in the classi- fication trees. As a final comment, we would like to suggest that the overall evaluation paradigm used in TREC does not properly assess the performance of systems on the routing task. Although ad hoc retrieval and routing are similar when viewed in terms of the basic tech- nology, systems designed and built to support these two applications have significantly different requirements. In particular, operational routing systems do not usually empha- size output ordering but instead focus on optimizing the trade-off between detection and false alarm rate. In this respect, at least, we believe that recall and fallout are better indi- cators of routing performance than recall and precision. Furthermore, artificially limiting reported output to the first 200 documents automatically discriminates against those routing systems that actually do attempt to perform the recall/fallout trade-off. A fairer set-up for the routing component of TREC would be to allow systems to report exactly those documents marked as relevant. Comparison of systems would be more complex since different systems will produce different numbers of documents, but individual scores would give a better picture of routing performance. 225