SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Machine Learning for Knowledge-Based Document Routing (A Report on the TREC-2 Experiment)
chapter
R. Tong
L. Appelbaum
National Institute of Standards and Technology
D. K. Harman
Machine Learning for Knowledge-Based Document Routing
(A Report on the TREC-2 Experiment)
Richard M. Tong, Lee A. Appelbaum
Advanced Decision Systems
(a division of Booz[OCRerr]Al1en & Hamilton, Inc.)
1500 Plymouth Street, Mountain [OCRerr]w, CA 94043
1 Introduction
This paper contains a description of the experi-
ments performed by Advanced Decision Systems as part of
the Second Text Retrieval Conference (TREC-2).1
The overall system we have developed for TREC-2
demonstrates how we can combine statistically[OCRerr]riented
machine learning techniques with a commercially available
knowledge-based information retrieval system. As in TREC-
1, the tool we using for the fully automatic construction of
routing queries is based on the Classification and Regression
Trees (CART) algorithm.2 However, in a departure from
TREC-1, we have expanded our definition of what consti-
tutes a document "feature" within the CART algorithm, and
also explored how the CART output can be used as the basis
of topic definitions that can be interpreted by the TOPICŪ
retrieval system developed by Verity, Inc. of Mountain View,
CA.
Section 2 of the paper contains a review of the
CART algorithm itself and the data structures it produces.
Section 3 of the paper describes the two basic algorithms we
have.devised for converting CART output into TOPIC read-
able ifies. Section 4 of the paper contains a description of our
experimental procedure and an analysis of the official
results, as well as data from a series of auxiliary experi-
ments. Section 5 contains some general comments on overall
performance. We conclude in Section 6 with a brief discus-
sion of possible future research directions.
1. Requests for flirther information about the IREC-2 experiments
should be directed to the authors at the address above, or electroni-
cally to either rtong[OCRerr]ads . corn or lee@ads . corn.
2. A comprehensive discussion of the CART algorithm can be
found in [1]. Details of previous work on the use of CART for
information retrieval are presented in [2] and in [3).
253
2 The CART Algorithm
CART has been shown to be useful when one has
access to datasets describing known classes of observations,
and wishes to obtain rules for classifying future observations
of unknown class[OCRerr]xactly as in the document routing prob-
lem. CART is particularly attractive when the dataset is
"messy" (i.e., is noisy and has many missing values) and
thus unsuitable for use with more traditional classification
techniques. In addition, and particularly important for the
document routing application, if it is important to be able to
specify both the misclassification costs and the prior proba-
bilities of class membership then CART has a direct way of
incorporating such information into the tree building pro-
cess. Finally, CART can generate auxiliary information, such
as the expected misclassification rate for the classifier as a
whole and for each terminal node in the tree, that is useful
for the document routing problem.
2.1 CART Processing Flow
Figure 1 shows how the CART algorithm is used to
construct the optimal classification tree based on the training
data provided to iLThe diagram shows the four sub-pro-
cesses used to generate the optimal tree, T*. The "raw" train-
ing data (i.e., the original texts of the articles), together with
the class specifications (i.e., the training data relevance judg-
ments) and the feature specifications (i.e., the words defined
to be features), are input to the Feature Extraction Module.
The output is a set of vectors that record the class member-
ship and the features contained in the training data. These
vectors, together with class priors and the cost function
(these are optional), are input to the Tree Growing Module
which then constructs the maximal tree [OCRerr] that charac-
terizes the training data. Since this tree overfits the data, the
next step is to construct a series of nested sub-trees by prun-
ing Tmax to the root. The sequence of sub-trees (T[OCRerr][OCRerr][OCRerr]> T1>
>Tn) are input to the Tree Selection Module which per-