SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Classification Trees for Document Routing, A Report on the TREC Experiment
chapter
R. Tong
A. Winkler
P. Gage
National Institute of Standards and Technology
Donna K. Harman
matic) is at least encouraging and definitely acceptable in several instances.
Some specific observations on the performance of the current implementation of the
CART algorithm are:
* Relying on the re-substitution estimates for the terminal nodes is a very
weak method for producing an output ranking. The estimates themselves
are not very good and when combined with optimal trees that emphasize
recall over precision give a largely undifferentiated output. As we noted
above, a scheme that makes use of surrogate split information to generate
a post hoc ranking shows much promise as a technique for improving our
scores in the TREC context.
* While our approach is totally automatic, it is restricted to using as fea-
tures only those words that appear in the information need statement.
This is obviously a limitation since the use of even simple query expan-
sion techniques (e.g., stemming and/or a synonym dictionary) is likely to
provide a richer and more effective set of initial features.
* Using words as features is possibly too "low-level" to ever allow stable,
robust classification trees to be produced. At a minimum, we probably
need to consider working with concepts rather than individual words.
Not only would this reduce the size of the feature space but would proba-
bly result in more intuitive trees. The disadvantage of this that it is not
clear where the concepts would come from, other than from a manually
constructed knowledge-base of some sort.
* We need to work with much bigger and more representative training sets.
Our preliminary experiment in this area shows, not surprisingly, that
adding more training examples can lead to dramatic changes in the classi-
fication trees.
As a final comment, we would like to suggest that the overall evaluation paradigm
used in TREC does not properly assess the performance of systems on the routing task.
Although ad hoc retrieval and routing are similar when viewed in terms of the basic tech-
nology, systems designed and built to support these two applications have significantly
different requirements. In particular, operational routing systems do not usually empha-
size output ordering but instead focus on optimizing the trade-off between detection and
false alarm rate. In this respect, at least, we believe that recall and fallout are better indi-
cators of routing performance than recall and precision. Furthermore, artificially limiting
reported output to the first 200 documents automatically discriminates against those
routing systems that actually do attempt to perform the recall/fallout trade-off. A fairer
set-up for the routing component of TREC would be to allow systems to report exactly
those documents marked as relevant. Comparison of systems would be more complex
since different systems will produce different numbers of documents, but individual
scores would give a better picture of routing performance.
225