SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Classification Trees for Document Routing, A Report on the TREC Experiment chapter R. Tong A. Winkler P. Gage National Institute of Standards and Technology Donna K. Harman CLASSIFICATION TREES FOR DOCUMENT ROUTING A Report on the TREC Experiment Richard Tong, Adam Winkler, Pamela Gage Advanced Decision Systems (a division of Booz.Allen & Hamilton) 1500 Plymouth Street, Mountain View, CA 94043 1. Introduction In this paper we describe an approach to document routing on the TREC corpus that employs a technique for the automatic construction of classification trees. The approach makes use of the Classification and Regression Trees (CART) algorithm that has seen application in various areas of machine learning [1,2]. Our initial work with this algorithm [3] has demonstrated that probabilistic struc- tures can be automatically acquired from a training set of documents with respect to a single target concept, or a set of related concepts. These structures can then be applied to individual documents to derive a posterior probability that the document is about a par- ticular target concept. In addition, CART provides the user with a mechanism for explic- itly controlling the precision and recall performance trade-offs. In the CART approach, the task is to provide a direct assessment of the probability that a given concept is in a document given all possible combinations of evidence. Classi- fication trees are used as the probabilistic structure for representing this direct assess- ment. When used to perform the document routing task, the CART algorithm takes as input a collection of example documents (the training set), each of which consists of a known class assignment and a vector of features (e.g., whether or not the document is about the topic of interest, together with a list of the feature words found in the docu- ment). In the general case, the feature values may be boolean, integer, real or nominal and, in addition, may be both noisy and incomplete. The output from CART is a binary decision tree that can be used to classify subsequent documents for which the class is not known. When the training set is noisy, trees of maximal size are invariably too large in that a smaller tree will perform better in terms of classification accuracy. This is the stan- dard statistical "overfitting" problem. CART addresses the overfitting problem via tree pruning and error estimation algorithms that locate the "right sized" tree, ensuring a parsimonious, accurate classifier. The remainder of the paper is divided into a number of sections. In Section 2 we briefly describe the processing steps performed by our system, first to generate the opti- mal classification tree and then to use the tree for classification. In Section 3 we illustrate how the TREC data is processed. In Section 4 we discuss our "official" TREC results. 209