SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Classification Trees for Document Routing, A Report on the TREC Experiment
chapter
R. Tong
A. Winkler
P. Gage
National Institute of Standards and Technology
Donna K. Harman
CLASSIFICATION TREES FOR DOCUMENT ROUTING
A Report on the TREC Experiment
Richard Tong, Adam Winkler, Pamela Gage
Advanced Decision Systems
(a division of Booz.Allen & Hamilton)
1500 Plymouth Street, Mountain View, CA 94043
1. Introduction
In this paper we describe an approach to document routing on the TREC corpus that
employs a technique for the automatic construction of classification trees. The approach
makes use of the Classification and Regression Trees (CART) algorithm that has seen
application in various areas of machine learning [1,2].
Our initial work with this algorithm [3] has demonstrated that probabilistic struc-
tures can be automatically acquired from a training set of documents with respect to a
single target concept, or a set of related concepts. These structures can then be applied to
individual documents to derive a posterior probability that the document is about a par-
ticular target concept. In addition, CART provides the user with a mechanism for explic-
itly controlling the precision and recall performance trade-offs.
In the CART approach, the task is to provide a direct assessment of the probability
that a given concept is in a document given all possible combinations of evidence. Classi-
fication trees are used as the probabilistic structure for representing this direct assess-
ment. When used to perform the document routing task, the CART algorithm takes as
input a collection of example documents (the training set), each of which consists of a
known class assignment and a vector of features (e.g., whether or not the document is
about the topic of interest, together with a list of the feature words found in the docu-
ment). In the general case, the feature values may be boolean, integer, real or nominal
and, in addition, may be both noisy and incomplete. The output from CART is a binary
decision tree that can be used to classify subsequent documents for which the class is not
known. When the training set is noisy, trees of maximal size are invariably too large in
that a smaller tree will perform better in terms of classification accuracy. This is the stan-
dard statistical "overfitting" problem. CART addresses the overfitting problem via tree
pruning and error estimation algorithms that locate the "right sized" tree, ensuring a
parsimonious, accurate classifier.
The remainder of the paper is divided into a number of sections. In Section 2 we
briefly describe the processing steps performed by our system, first to generate the opti-
mal classification tree and then to use the tree for classification. In Section 3 we illustrate
how the TREC data is processed. In Section 4 we discuss our "official" TREC results.
209