SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Machine Learning for Knowledge-Based Document Routing (A Report on the TREC-2 Experiment)
chapter
R. Tong
L. Appelbaum
National Institute of Standards and Technology
D. K. Harman
Given this list of features we can generate the
CART training vectors by testing all the documents in the
training set for the presence of absence of the feature.6 The
resulting vectors, together with the ground truth classifica-
tion, are the input to CART. The output from CART con-
tains information about the sequence of nested decision
trees and a specification of the optiaal tree.
For our example, the maximal tree is shown in
Figure 2. This tree is to be read from left to right. Thus the
Associated with each decision node is information
that tells:
* what level in the tree the test occurs; the k value -
in all cases k ranges from k=1, which corresponds to
the maximal tree, to k[OCRerr]kmax, which corresponds to
the null tree and where kmax is dependent on the
actual tree grown by CART;
class 0 (k=2,Rt=0.230318)
CONCERN<=0.50 (k--2,ac=0,R[OCRerr]=0.25l256)
class 1 (k=2,R[OCRerr]=0.0l9586)
TRANSMIT<=0.50 (k=2,ac=O1Rt=0.020938)
class 0 (k=2,Rt=0.000000)
GLASS<=0.50 (k=5,ac=0,Rt=0.261725)
class 1 (k=2,Rt=0.007834)
TRANSMIT<=0.50 (k=2,ac=0,R[OCRerr]=0.0l0469)
class 0 (k=2,Rt=0.000000)
CONCERN<=0.50 (k--2,ac=0,Rt=0.010469)
class 0 (k=2,Rt=0.000000)
VIA<=0.50 (k=5,ac=0,R[OCRerr]=0.30360l)
class 1 (k=41Rt=0.003917)
TRANSMIT<[OCRerr]0.50 (k=5,ac=l,Rt=0.007834)
class 0 (k=4,R[OCRerr]=0.000000)
SIGN<--0.50 (k=5,ac=0,Rt=0.345477)
class 1 (k--5,Rt=0.003917)
LIGHT<=0.50 (k=6,ac=0,Rt=0.376884)
class 0 (k=3,Rt=0.000000)
GLASS<=0.50 (k=5,ac=0,R[OCRerr]=0.03l407)
class 0 (k=3,R[OCRerr]=0.0l0469)
TRANSMIT<=0.50 (k=3,ac=0,R[OCRerr]-0.03l407)
class 0 (k=3,Rt=0.010469)
VIA<=0.50 (k=3,ac=l,Rt=0.0l95[OCRerr]6)
class 1 (k--3,Rt=0.003917)
CONTRACT<=0.50 (k=7,ac=l,Rt=0.497487)
class 1 (k=4,Rt=0.019586)
SIGN<=0.50 (k=4,ac=l,Rt=0.023503)
class 0 (k=4,R[OCRerr]=0.000000)
LIGHT<=0.50 (k=6,ac=l,R[OCRerr]=0.02742l)
class 0 (k=4,Rt=0.000000)
Figure 2: Maximal Tree
first test is on the presence of the stem CONTRACT. If the
stem is present (i.e., if the test CONTRACT<=O .5 fails)
then the tree branches downwards (this is leftward in CART
jargon); if the stem is not present (i.e., the test succeeds)
then the tree branches upwards (rightward).
5. We only use the <nan>, <con> and <del> fields, and the stop
word list is taken from (4]. The stemming algorithm is taken from
[5).
6. In our TREC-1 experiments we also made use of the frequency
of occurrence of the features in the documents. Since our ultimate
goal here is to generate `IDPIC trees we restrict ourselves to just
testing for the presence or absence of the feature - this allows us
to perform a straightforward conversion between CART and
F)PIC.
255
the class to be assigned to this node if the tree were
pruned here; the ac value - in the present case
ac= 1 for the class that corresponds to a relevant doc-
ument and ac 0 for the class that corresponds to a
non-relevant document; and
* the error rate of the node; the Rt value - this indi-
cates the resubstitution error rate fbr the specific
node.
The terminal nodes in the tree have similar infor-
mation associated with them. Notice though that terminal
nodes, by definition, specify to which class the node corre-
sponds.