SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Machine Learning for Knowledge-Based Document Routing (A Report on the TREC-2 Experiment) chapter R. Tong L. Appelbaum National Institute of Standards and Technology D. K. Harman Machine Learning for Knowledge-Based Document Routing (A Report on the TREC-2 Experiment) Richard M. Tong, Lee A. Appelbaum Advanced Decision Systems (a division of Booz[OCRerr]Al1en & Hamilton, Inc.) 1500 Plymouth Street, Mountain [OCRerr]w, CA 94043 1 Introduction This paper contains a description of the experi- ments performed by Advanced Decision Systems as part of the Second Text Retrieval Conference (TREC-2).1 The overall system we have developed for TREC-2 demonstrates how we can combine statistically[OCRerr]riented machine learning techniques with a commercially available knowledge-based information retrieval system. As in TREC- 1, the tool we using for the fully automatic construction of routing queries is based on the Classification and Regression Trees (CART) algorithm.2 However, in a departure from TREC-1, we have expanded our definition of what consti- tutes a document "feature" within the CART algorithm, and also explored how the CART output can be used as the basis of topic definitions that can be interpreted by the TOPICŪ retrieval system developed by Verity, Inc. of Mountain View, CA. Section 2 of the paper contains a review of the CART algorithm itself and the data structures it produces. Section 3 of the paper describes the two basic algorithms we have.devised for converting CART output into TOPIC read- able ifies. Section 4 of the paper contains a description of our experimental procedure and an analysis of the official results, as well as data from a series of auxiliary experi- ments. Section 5 contains some general comments on overall performance. We conclude in Section 6 with a brief discus- sion of possible future research directions. 1. Requests for flirther information about the IREC-2 experiments should be directed to the authors at the address above, or electroni- cally to either rtong[OCRerr]ads . corn or lee@ads . corn. 2. A comprehensive discussion of the CART algorithm can be found in [1]. Details of previous work on the use of CART for information retrieval are presented in [2] and in [3). 253 2 The CART Algorithm CART has been shown to be useful when one has access to datasets describing known classes of observations, and wishes to obtain rules for classifying future observations of unknown class[OCRerr]xactly as in the document routing prob- lem. CART is particularly attractive when the dataset is "messy" (i.e., is noisy and has many missing values) and thus unsuitable for use with more traditional classification techniques. In addition, and particularly important for the document routing application, if it is important to be able to specify both the misclassification costs and the prior proba- bilities of class membership then CART has a direct way of incorporating such information into the tree building pro- cess. Finally, CART can generate auxiliary information, such as the expected misclassification rate for the classifier as a whole and for each terminal node in the tree, that is useful for the document routing problem. 2.1 CART Processing Flow Figure 1 shows how the CART algorithm is used to construct the optimal classification tree based on the training data provided to iLThe diagram shows the four sub-pro- cesses used to generate the optimal tree, T*. The "raw" train- ing data (i.e., the original texts of the articles), together with the class specifications (i.e., the training data relevance judg- ments) and the feature specifications (i.e., the words defined to be features), are input to the Feature Extraction Module. The output is a set of vectors that record the class member- ship and the features contained in the training data. These vectors, together with class priors and the cost function (these are optional), are input to the Tree Growing Module which then constructs the maximal tree [OCRerr] that charac- terizes the training data. Since this tree overfits the data, the next step is to construct a series of nested sub-trees by prun- ing Tmax to the root. The sequence of sub-trees (T[OCRerr][OCRerr][OCRerr]> T1> >Tn) are input to the Tree Selection Module which per-