SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Machine Learning for Knowledge-Based Document Routing (A Report on the TREC-2 Experiment)
chapter
R. Tong
L. Appelbaum
National Institute of Standards and Technology
D. K. Harman
class 0 (k--lO,Rt-0.0628l4)
TELEPHONE<=0.50 (k=ll,ac=l,R[OCRerr]=0.497487)
class 1 (k=8,R[OCRerr]-0.l42l39)
TRANSMITTED<=0.50 (k=9,ac=l,Rt=0.153984)
class 0 (k=8,R[OCRerr]=0.000000)
LIGHT<=0.50 (k=l0,ac=l,R[OCRerr]=0.2053l2)
class 0 (k=7,Rt=0.000000)
TRANSMITTEL<=0.50 (k=9,ac=0,R[OCRerr]=0.0l0469)
class 1 (k=7,R[OCRerr]=0.000000)
ACTUAL<=0.50 (k=9,ac=0,R[OCRerr]=0.03l407)
class 1 (k=9,R[OCRerr]=0.000000)
Figure 4: Optimal Tree Using Unstemmed Features
have constrained the tests to be of the form "Is word X
present or not?" we can easily model this conjunction in
TOPIC using AND and NOT operators. Since there will gen-
erally be multiple paths to class - 1 leaf nodes, the algo-
rithm combines the separate paths in the TOPIC definition
using the OR operator.
To illustrate, the optimal tree for our example con-
tains the following conjunctive descriptions of class-i leaf
nodes:
1. CONTRACT
2. SIGN and not LIGHT and not CONTRACT
3. VIA and not SIGN and not LIGHT and not
CONTRACT
which leads directly to a TOPIC outline of the form:
Topic[OCRerr]097 <Or>
* 0.77 TopicStyle[OCRerr]097[OCRerr]l <Or>
** 0.99 TopicPath[OCRerr]097_1-1 <And>
`CONTRACT'
`LIGHT'
`SIGN'
`VIA'
** 1.00 TopicPath[OCRerr]097_1-2 <And>
`CONTRACT'
`LIGHT'
`SIGN'
** 0.97 TopicPath[OCRerr]097_1-3 <And>
`CONTRACT'
Here we use a notation for topic names that ensures
uniqueness. Notice that in this model we use the resubstim-
tion rates (actually 1-Rt) to define a weight for the individ-
ual conjuncts and the overall cross-validation rate (actually
1-[OCRerr][OCRerr]) as the weight for the disjuncts. Note also that since
TOPIC only uses weights defined to two decimal places we
have rounded the weights derived from the CART tree.
7. TOPIC uses a representation for ooncepts that can be recorded
in so[OCRerr]alled outline format. A collection of such specifications is
used by TOPIC to built the concept trees used for retrieval.
257
3.2 Second Canonical Form
The second canonical form makes use of the deci-
sion variables chosen by CART but not the actual decision
function. This model is based on two observations:
* first, that the set of variables used by CART are,
when taken as a whole, indicative of the general topic
of the information need statement, and
* second, that every variable used in the tree is on the
path to at least one class-O node and at least one
class-i node.
Thus, from an information retrieval perspective, all
the decision variables have some contribution to make to an
assessment of the relevance of a document. So rather than
use the specific decision function constructed by CART we
can replace this with one that can be thought of as "The
more features present in a document the better." This is
modelled straightforwardly in TOPIC using the ACCRUE
operator.
To illustrate, the maximal tree for our example
contains the following set of decision variables:
CONCERN CONTRACT GLASS LIGHT SIGN TRANSMIT
VIA
which leads directly to a TOPIC outline of the form:
Topic[OCRerr]097 <Or>
* 0.70 TopicSty1e[OCRerr]097[OCRerr]2 <Accrue>
** 0.25 `CONCERN'
** 0.75 `CONTRACT'
** 0.75 `GLASS'
** 0.75 `LIGHT'
** 0.75 `SIGN'
** 0.75 `TRANSMIT'
** 0.75 `VIA
The weighting scheme we have used gives higher values to
variables (features) that are in the optimal tree, intermediate
values to variables on the fringe of the optimal tree (there