SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Machine Learning for Knowledge-Based Document Routing (A Report on the TREC-2 Experiment)
chapter
R. Tong
L. Appelbaum
National Institute of Standards and Technology
D. K. Harman
seemed unsatisfactory. We have not been able to identify
any specific conditions under which we could expect the
CART trees to perform well.
We believe, nevertheless, that further experiments
to assess the utility of using a variety of extensions to the
basic CART technique would still be of interest. In particu-
lar, the use of "low-level" concepts as features, surrogate
split information, larger training sets, and features drawn
from the complete corpus rather than the information need
statements, are all likely to assist in the construction of more
effective trees.
The main focus of our TREC-2 effort has been to
explore the idea that the CART trees can form the basis of a
semi-automated approach to building knowledge-based
descriptions of routing topics. To that end, we developed
two techniques for converting CART output into a form that
can be used by the TOPIC retrieval engine from Verity, Inc.
In the first technique we perform a "lossless" transformation
of the CART tree into a TOPIC tree. In the second we gener-
alize the CART tree, although add no new features.
As the examples contained in the paper show, the
TOPIC trees generated in the first canonical form are "skel-
etal." This is just as we would expect. Since CART is a par-
simonious classifier, rather than a broad-based information
retrieval tool, it produces minimally complex decision trees
with respect to the expected misclassification rate.
From an information retrieval perspective, the sec-
ond canonical form seems more like the ones we might have
built by hand. It uses a range of features and gives them
weights based on their ability to discriminate among the
tralning data. In effect, we have used CART to indicate
wliich of the features are of use in defining the topic and
then generalized the CART decision function using our
(external) knowledge of the information retrieval problem.
Using these automatically constructed TOPIC trees
as a starting point, we conducted a limited series of tests to
assess the impact of performing minimal "editing" of these
trees. For the two topics selected, we were able to produce
significant performance galns with edits that added no new
information (at least at the level of the features used) and
that took of the order of only a few minutes to implement.
We are encouraged by these results, and, while the
generalization of them will require a more carefully con-
trolled series of experiments, we are now of the opinion that
the most effective role for machine learning techniques in
information retrieval is as a tool for producing candidate
descriptions of information need. These candidates can then
be reviewed by end-users who can easily make obvious cor-
263
rections and modifications. We intend to explore this idea in
more detail and report on our results in ThEC-3.
6 Future Research
There a number of directions in which we might
develop the basic research ideas presented in this paper. We
briefly consider a number of them here.
We currenfly use just two classes (relevant and not
relevant), but nothing in CART prevents it working with
multiple classes. For the document routing problem, there is
a case to be made for adding a third class - unknown rele-
vance. Adding such a class might allow us to make use of
larger tralning sets without the costs associated with devel-
oping large ground truths.
One way of extending the skeleton TOPIC trees
produced by our tool is to make use of external lexical
resources. For example, we might investigate the use of
WordNet as a way to expand each of the classification fea-
tures into a set of related words. Similarly, we might investi-
gate the use of TOPIC's own lexical resources (e.g., the
thesaurus and Soundex tools) by replacing the unstemmed
words in the topic outline files with the appropriate TOPIC
operator (e.g., <SQUNDEX> word, or <THESARUS>
word).
Mthough we have used CART as the module for
building the initial classification tree, we might be interested
in exploring other tree building tools that have been used in
the machine learning community. For example, the C4.5
algorithm by Quinlan [6], or various algorithms based on
Bayesian methods such as Minimum Message Length mod-
els and decision graphs [7]. All of these tools generate deci-
sion trees from training data but offer different
mathematical philosophies to justify their approaches.
Finally, as in all machine learning problems, the
initial choice of features over which to learn is extremely
critical to the overall success of the process. An investiga-
tion of various extended feature definition tools (e.g., recog-
nizing key phrase and proper names), as well as exploring
the impact of making different assumptions from TOPIC
about how the lexical tokens in the texts are to be treated,
would almost certainly yield important insights.
References
1. Breiman, L.; Friedman, J. H.; Olshen, R. A.; Stone,
C. J. Classification and regression trees. Pacific
Grove, CA: Wadsworth & Brooks; 1984.