SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Machine Learning for Knowledge-Based Document Routing (A Report on the TREC-2 Experiment)
chapter
R. Tong
L. Appelbaum
National Institute of Standards and Technology
D. K. Harman
Largest
Tree (Tmax)
mum:'mm
Raw Feature Tree Tree Optimal
Training Growing Tree (T*)
Data
(T
Class Specs. Class Priors
Feature Specs. Cost Function
Nested Sub-Trees
max>Ti >T2 ...>Tn)
CART
Training Vectors
Figure 1: CART Processing How
forms a cross-validation analysis on the sequence and
selects that tree with the lowest cross-validation error.3 This
is T*.
In our TREC-2 system, we take T* and convert it
into a TOPIC outline file as descrileed in Section 3. That is,
rather than use CART itself to perform the routing on the
unseen documents (as we did in ThEC-l), we use the
CART trees as skeletons for TOPIC concepts, and then have
TOPIC do the routing. An advantage of this is that we can
make use of TOPIC's extensively optimized text database
capabilities, thus allowing us to easily generate the output
files needed for the official scoring program.
2.2 Data Structures Generated by CART
To illustrate the processing that CART performs,
we will use an example taken from the TREC-2 corpus. A
example query is as follows:
<top>
<head> Tipster Topic Description
<num> Number: 097
<dom> Domain: Science and Technology
<title> Topic: Fiber Optics Applications
<desc> Description:
Document must identify instances of fiber
optics technology actually in use.
<narr> Narrative:
To be relevant, a document must describe
actual operational situations in which
fiber optics are being employed, or will
be employed. A document describing future
fiber optics use will be relevant only if
contracts have been signed concerning the
future application.
<con> Concept(s):
1. fiber optic, light
2. telephone, LAN, television
<fac> Factor(s)
<def> Definition(s)
1. Fiber optics refers to technology in
which information is passed via laser
light transmitted through glass or plastic
fibers.
<\top>
This is a very comprehensive statement of infor-
mation need and provides a rich set of features that we can
use for CART.4 Our basic procedure is to extract from the
information need statement all the unique content words
and then stem them, which gives the following list:
ACT APPL CONCERN CONTRACT DESCRIB DOCU
EMPLOY FIB FUT GLASS INFORM LAN LAS LIGHT
OPER OPT PAS PLAST REF RELEV SIGN SITU
TECHNOLOG TELEPHON TELEV TRANSMIT VIA5
3. The algorithm actually minimizes with respect to both the
cross-validation error and the tree complexity. So that if two trees
have statistically indistinguishable error rates, then the smaller of
the two trees will be selected as optimal.
254
4. In general, determining what set of features to use for CART is
a matter of experience and judgement. For the [OCRerr]IREC-2 corpus we
have made use of the information need statements, but other
approaches, such as using all the unique words in the training set,
are equally valid. In fact, it is this freedom of choice of features
that gives CART a great deal of its flexibility.