SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Classification Trees for Document Routing, A Report on the TREC Experiment
chapter
R. Tong
A. Winkler
P. Gage
National Institute of Standards and Technology
Donna K. Harman
For example, the tree for Topic 22 [OCRerr]`Counternarcotics" becomes:
class 0 (0.000)
drug<=0 .50
class 1 (0.905)
Thus although the tree size is unchanged, the test is now on the word drug instead of
coca. As we might expect this turns out to be a much less generally useful test and the
tree only identifies 8 of the relevant articles-actually the minimum retrieved by a Cate-
gory B system.
The tree for Topic 15 [OCRerr][OCRerr]CEO" is one which shows marked change from the adsbal ver-
sion, growing from two decision nodes to twelve, and retrieving 49 relevant documents
instead of 29. The tree is:
class 0
chiet<=0 .50
(0.000)
class 0 (0.000)
executive<=0 .50
class 1 (1.000)
company<=0 .50
class 0 (0.000)
executive<=l .50
class 0 (0.000)
name<=0 .50
class 1 (1.000)
coinpany<=l .50
class 0 (0.000)
resign<=0 .50
class 0 (0.000)
chief<=l .50
class 1 (1.000)
name<=3 .50
class 0
executive<=9 .00
class 0 (0.000)
appoint<=0 .50
class 0 (0.000)
ceo<=0 .50
class 0 (0.000)
(0.000)
This is a much more complex structure than the other trees illustrated so far. Note that
there are three terminal nodes that lead to a document being classified as relevant.
Although they are in the same sub-tree defined by the expression:
chiet>0 & ceo & `appoint & executive>0 & executive<l0 & name<4
they make minor distinctions based on the words company, name, resign and chief.
Thus we have three further tests:
executive<2 & company
executive>l & company>l & resign>0 & chiet>l
executive>l & coinpany<2 & name>0
221