NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)

SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Machine Learning for Knowledge-Based Document Routing (A Report on the TREC-2 Experiment) chapter R. Tong L. Appelbaum National Institute of Standards and Technology D. K. Harman class 0 (k--lO,Rt-0.0628l4) TELEPHONE<=0.50 (k=ll,ac=l,R[OCRerr]=0.497487) class 1 (k=8,R[OCRerr]-0.l42l39) TRANSMITTED<=0.50 (k=9,ac=l,Rt=0.153984) class 0 (k=8,R[OCRerr]=0.000000) LIGHT<=0.50 (k=l0,ac=l,R[OCRerr]=0.2053l2) class 0 (k=7,Rt=0.000000) TRANSMITTEL<=0.50 (k=9,ac=0,R[OCRerr]=0.0l0469) class 1 (k=7,R[OCRerr]=0.000000) ACTUAL<=0.50 (k=9,ac=0,R[OCRerr]=0.03l407) class 1 (k=9,R[OCRerr]=0.000000) Figure 4: Optimal Tree Using Unstemmed Features have constrained the tests to be of the form "Is word X present or not?" we can easily model this conjunction in TOPIC using AND and NOT operators. Since there will gen- erally be multiple paths to class - 1 leaf nodes, the algo- rithm combines the separate paths in the TOPIC definition using the OR operator. To illustrate, the optimal tree for our example con- tains the following conjunctive descriptions of class-i leaf nodes: 1. CONTRACT 2. SIGN and not LIGHT and not CONTRACT 3. VIA and not SIGN and not LIGHT and not CONTRACT which leads directly to a TOPIC outline of the form: Topic[OCRerr]097 <Or> * 0.77 TopicStyle[OCRerr]097[OCRerr]l <Or> ** 0.99 TopicPath[OCRerr]097_1-1 <And> `CONTRACT' `LIGHT' `SIGN' `VIA' ** 1.00 TopicPath[OCRerr]097_1-2 <And> `CONTRACT' `LIGHT' `SIGN' ** 0.97 TopicPath[OCRerr]097_1-3 <And> `CONTRACT' Here we use a notation for topic names that ensures uniqueness. Notice that in this model we use the resubstim- tion rates (actually 1-Rt) to define a weight for the individ- ual conjuncts and the overall cross-validation rate (actually 1-[OCRerr][OCRerr]) as the weight for the disjuncts. Note also that since TOPIC only uses weights defined to two decimal places we have rounded the weights derived from the CART tree. 7. TOPIC uses a representation for ooncepts that can be recorded in so[OCRerr]alled outline format. A collection of such specifications is used by TOPIC to built the concept trees used for retrieval. 257 3.2 Second Canonical Form The second canonical form makes use of the deci- sion variables chosen by CART but not the actual decision function. This model is based on two observations: * first, that the set of variables used by CART are, when taken as a whole, indicative of the general topic of the information need statement, and * second, that every variable used in the tree is on the path to at least one class-O node and at least one class-i node. Thus, from an information retrieval perspective, all the decision variables have some contribution to make to an assessment of the relevance of a document. So rather than use the specific decision function constructed by CART we can replace this with one that can be thought of as "The more features present in a document the better." This is modelled straightforwardly in TOPIC using the ACCRUE operator. To illustrate, the maximal tree for our example contains the following set of decision variables: CONCERN CONTRACT GLASS LIGHT SIGN TRANSMIT VIA which leads directly to a TOPIC outline of the form: Topic[OCRerr]097 <Or> * 0.70 TopicSty1e[OCRerr]097[OCRerr]2 <Accrue> ** 0.25 `CONCERN' ** 0.75 `CONTRACT' ** 0.75 `GLASS' ** 0.75 `LIGHT' ** 0.75 `SIGN' ** 0.75 `TRANSMIT' ** 0.75 `VIA The weighting scheme we have used gives higher values to variables (features) that are in the optimal tree, intermediate values to variables on the fringe of the optimal tree (there