NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)

SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Machine Learning for Knowledge-Based Document Routing (A Report on the TREC-2 Experiment) chapter R. Tong L. Appelbaum National Institute of Standards and Technology D. K. Harman CART also generates the following table of error estimates that are used in selecting the optimal tree. Here k is the k-value, I T I is the size of the tree (measured in terms of the number of terminals), R (T) is the resubstitution rate for the overall tree, and Rcv (T) is the cross-validation rate for the overall tree. Recall that CART minimizes with respect to both size and cross-validation rate, so that for our example, the optimal tree is when k=4. k IT R(T) Rcv(T) 1 16 0.3100 0.3037 2 11 0.3140 0.2866 3 8 0.3206 0.2353 *4 5 0.3323 0.2296 5 2 0.4043 0.3891 6 1 0.4975 0.6285 So, pruning the maximal tree so that all nodes have k>4 gives us the following optimal tree shown in Figure 3. 3 20 0.1313 0.2935 4 10 0.1761 0.2758 5 7 0.1971 0.2815 *6 6 0.2050 0.2929 7 5 0.2154 0.3101 8 4 0.2273 0.3101 9 2 0.2681 0.3328 10 1 0.4975 0.6264 so that the optimal tree is as shown in Figure 4. 3 Transforming CART Trees into TOPIC Outline Specifications The trees produced by CART are not directly usable by TOPIC because they represent the information we need (i.e., the decision function and the associated decision variables) in a form that is incompatible with the TOPIC class 0 (k=5,RL=0.261725) VIA<=0.50 (k--5,ac--0,Rt-0.303601) class 1 (k=5,R[OCRerr]=0.007834) SIGN<=0.50 (k--5,ac=0,R[OCRerr]=0.345477) class 1 ([OCRerr]=5,Rt=0.003917) LIGHT<=0.50 (k=6,ac=0,Rt=0.376884) class 0 (k=5,Rt=0.031407) CONTRACT<=0.50 (k=7,ac=1[OCRerr]R[OCRerr]=0.497487) class 1 (k=6,Rt=0.027421) Figure 3: Optimal Tree Using Stemmed Features Notice that by pruning away the deeper nodes in the tree we are left with a tree that tests on just five of the original 26 stems, and has three paths that lead to class - 1 nodes (i.e., a decision that the document is relevant). The unstemmed version of our procedure is identi- cal except that we do no stemming of the unique words extracted from the information need statement. For the example, this produces the following list of features: ACTUAL APPLICATION CONCERNING CONTRACTS DESCRIBE DESCRIBING DOCUMENT EMPLOYED FIBER FIBERS FUTURE GLASS INFORMATION LAN LASER LIGHT OPERATIONAL OPTIC OPTICS PASSED PLASTIC REFERS RELEVANT SIGNED SITUATIONS TECHNOLOGY TELEPHONE TELEVISION TRANSMITTED VIA and the following error table: k TI R(T) Rcv(T) 1 36 0.0932 0.3168 2 33 0.0972 0.2763 256 knowledge-representation. The main problem then is to define a transformation from the CART representation to the TOPIC representation that at least preserves the decision information and perhaps augments it so that we get improved routing performance. In this section we explore two possible strategies for con- structing these transformations. The first is a strict re-coding of the information in the CART tree. The second generalizes the intent of the CART tree but adds no new information. Both of these techniques are "automatic," in the sense that, once various parameters have been chosen, the algorithms work without human intervention. 3.1 First Canonical Form The first canonical from completely preserves the decision function generated by CART and involves a simple mapping of the CART tree into TOPIC outline file format.7 To do this, we begin by observing that each path in the CART tree from the root to a class-i leaf node consti- tutes a conjunction of tests of decision variables. Since we