SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Classification Trees for Document Routing, A Report on the TREC Experiment chapter R. Tong A. Winkler P. Gage National Institute of Standards and Technology Donna K. Harman shows the four sub-processes used to generate the optimal tree, T*. The "raw" training data (i.e., the original texts of the articles), together with the class specifications (i.e., the training data relevance judgments) and the feature specifications (i.e., the words defined to be features), are input to the Feature Extraction Module. The output is a set of vectors that record the class membership and the features contained in the training data. These vectors, together with class priors and the cost function (these are optional), are input to the Tree Growing Module which then constructs the maximal tree (T[OCRerr][OCRerr]) that character- izes the training data. Since this tree overfits the data, the next step is to construct a series of nested sub-trees by pruning Tmax to the root. The sequence of sub-trees (T[OCRerr][OCRerr][OCRerr]> T1> ... >Tn) are input to the Tree Selection Module which performs a cross-validation analy- sis on the sequence and selects that tree with the lowest cross-validation error1. This is T*. The lower part of the diagram shows how T* is used to classify documents of unknown class. As in tree building, the "raw" test data (i.e., the unprocessed texts of the test documents) are passed, together with the feature specifications, to the Feature Extraction Module, which in turn generates a feature vector for each text. These vectors are passed to the Classifier Module which uses the optimal tree to make the classification decision (i.e., whether the document is relevant or not). 3. Working with the TREC Corpus Since the application of the CART algorithm to document routing is a relatively unex- plored area of research (see [3] for some of our preliminary results with a small test cor- pus), we chose to participate in TREC as a Category B group. That is, we worked only with the Wall Street Journal articles and used only the first 25 query topics. In addition, since CART is intended to be used with training data, it most readily lends itself to the document routing problem. Thus we did not address the ad hoc retrieval problem in our experiments. Our primary goal in working with the TREC data was to develop a totally automatic approach to generating document classification trees, working only with the information need statements and the training data provided. That is, we wanted to test the hypothe- sis that CART would be a valuable tool in situations where the user can formulate an information need in some detail and can provide examples of documents that are rele- vant and examples of those that are not. Such a scenario is common in organizations that monitor large volumes of real-time electronic documents. To illustrate how we used the information need statements, consider the following topic description: 1. The algorithm actually minimizes with respect to both the cross-validation error and the tree complexity So that if two trees have statistically indistinguishable error rates, then the smaller of the two trees will be selected as optimal. 211