SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Classification Trees for Document Routing, A Report on the TREC Experiment
chapter
R. Tong
A. Winkler
P. Gage
National Institute of Standards and Technology
Donna K. Harman
above, the optimal tree is:
class 0 (0.000)
co[OCRerr]pany<=0 .50
class 1 (0.981)
federal<=9 .00
class 0 (0.000)
takeover<=l .50
class 0 (0.000)
securities<=l .50
class 0 (0.000)
company<=4 .50
class 0 (0.000)
fair<=0 .50
class 0 (0.000)
This is a the binary classification tree against which new documents can be tested. The
tree is read from left to right. Branches upwards correspond to "yes" answers to the test
at the node; branches downwards correspond to no answers. Class 0 terminals cor-
respond to the "not relevant" decision; Class 1 terminals correspond to the "relevant"
decision. The numbers in parentheses are estimates of the probability that the document
4
is relevant
Thus in this example if a document has fewer than 0.5 occurrences of the word corn-
pany (i.e., the word does not appear) then the document is not relevant. On the other
hand if the document has more than 0.5 occurrences (i.e., the word does appear) then
additional features are tested to lead to a classification decision. Note that a test often
requires multiple occurrences of the feature (e.g., for the word federal). Also note that
this tree has only one Class 1 terminal, and that the pattern of features that indicate a rel-
evant document can be written as a logical expression of feature counts. Thus we have:
company>0 & fair & cornpany<5
& securities<2 & takeover<2 & federal<lO
as a description of a relevant document.
Other overall system features of note are:
* The exact word is used as the basis of the feature, although our imple-
mentation for TREC the actual test is case insensitive and is a substring
match-so the feature fair would match with Fair or FAIR or fairs
as well as with fairground. We do not, though, perform any stemming
per Se.
* No additional data structures are needed by CART. That is, we did not
use any external data (e.g., thesauri or knowledge-bases) to help with tree
construction.
We spent approximately one-person month of effort in developing the
4. These are generated by CART from re-substitution error estimates and are, therefore, overly
optirnistic.
214