SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Machine Learning for Knowledge-Based Document Routing (A Report on the TREC-2 Experiment)
chapter
R. Tong
L. Appelbaum
National Institute of Standards and Technology
D. K. Harman
CART also generates the following table of error
estimates that are used in selecting the optimal tree. Here k
is the k-value, I T I is the size of the tree (measured in terms
of the number of terminals), R (T) is the resubstitution rate
for the overall tree, and Rcv (T) is the cross-validation rate
for the overall tree. Recall that CART minimizes with
respect to both size and cross-validation rate, so that for our
example, the optimal tree is when k=4.
k IT R(T) Rcv(T)
1 16 0.3100 0.3037
2 11 0.3140 0.2866
3 8 0.3206 0.2353
*4 5 0.3323 0.2296
5 2 0.4043 0.3891
6 1 0.4975 0.6285
So, pruning the maximal tree so that all nodes have k>4
gives us the following optimal tree shown in Figure 3.
3 20 0.1313 0.2935
4 10 0.1761 0.2758
5 7 0.1971 0.2815
*6 6 0.2050 0.2929
7 5 0.2154 0.3101
8 4 0.2273 0.3101
9 2 0.2681 0.3328
10 1 0.4975 0.6264
so that the optimal tree is as shown in Figure 4.
3 Transforming CART Trees into TOPIC
Outline Specifications
The trees produced by CART are not directly
usable by TOPIC because they represent the information we
need (i.e., the decision function and the associated decision
variables) in a form that is incompatible with the TOPIC
class 0 (k=5,RL=0.261725)
VIA<=0.50 (k--5,ac--0,Rt-0.303601)
class 1 (k=5,R[OCRerr]=0.007834)
SIGN<=0.50 (k--5,ac=0,R[OCRerr]=0.345477)
class 1 ([OCRerr]=5,Rt=0.003917)
LIGHT<=0.50 (k=6,ac=0,Rt=0.376884)
class 0 (k=5,Rt=0.031407)
CONTRACT<=0.50 (k=7,ac=1[OCRerr]R[OCRerr]=0.497487)
class 1 (k=6,Rt=0.027421)
Figure 3: Optimal Tree Using Stemmed Features
Notice that by pruning away the deeper nodes in the tree we
are left with a tree that tests on just five of the original 26
stems, and has three paths that lead to class - 1 nodes (i.e.,
a decision that the document is relevant).
The unstemmed version of our procedure is identi-
cal except that we do no stemming of the unique words
extracted from the information need statement. For the
example, this produces the following list of features:
ACTUAL APPLICATION CONCERNING CONTRACTS
DESCRIBE DESCRIBING DOCUMENT EMPLOYED
FIBER FIBERS FUTURE GLASS INFORMATION LAN
LASER LIGHT OPERATIONAL OPTIC OPTICS
PASSED PLASTIC REFERS RELEVANT SIGNED
SITUATIONS TECHNOLOGY TELEPHONE TELEVISION
TRANSMITTED VIA
and the following error table:
k TI R(T) Rcv(T)
1 36 0.0932 0.3168
2 33 0.0972 0.2763
256
knowledge-representation. The main problem then is to
define a transformation from the CART representation to the
TOPIC representation that at least preserves the decision
information and perhaps augments it so that we get
improved routing performance.
In this section we explore two possible strategies for con-
structing these transformations. The first is a strict re-coding
of the information in the CART tree. The second generalizes
the intent of the CART tree but adds no new information.
Both of these techniques are "automatic," in the sense that,
once various parameters have been chosen, the algorithms
work without human intervention.
3.1 First Canonical Form
The first canonical from completely preserves the
decision function generated by CART and involves a simple
mapping of the CART tree into TOPIC outline file format.7
To do this, we begin by observing that each path in the
CART tree from the root to a class-i leaf node consti-
tutes a conjunction of tests of decision variables. Since we