NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)

SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Classification Trees for Document Routing, A Report on the TREC Experiment chapter R. Tong A. Winkler P. Gage National Institute of Standards and Technology Donna K. Harman above, the optimal tree is: class 0 (0.000) co[OCRerr]pany<=0 .50 class 1 (0.981) federal<=9 .00 class 0 (0.000) takeover<=l .50 class 0 (0.000) securities<=l .50 class 0 (0.000) company<=4 .50 class 0 (0.000) fair<=0 .50 class 0 (0.000) This is a the binary classification tree against which new documents can be tested. The tree is read from left to right. Branches upwards correspond to "yes" answers to the test at the node; branches downwards correspond to no answers. Class 0 terminals cor- respond to the "not relevant" decision; Class 1 terminals correspond to the "relevant" decision. The numbers in parentheses are estimates of the probability that the document 4 is relevant Thus in this example if a document has fewer than 0.5 occurrences of the word corn- pany (i.e., the word does not appear) then the document is not relevant. On the other hand if the document has more than 0.5 occurrences (i.e., the word does appear) then additional features are tested to lead to a classification decision. Note that a test often requires multiple occurrences of the feature (e.g., for the word federal). Also note that this tree has only one Class 1 terminal, and that the pattern of features that indicate a rel- evant document can be written as a logical expression of feature counts. Thus we have: company>0 & fair & cornpany<5 & securities<2 & takeover<2 & federal<lO as a description of a relevant document. Other overall system features of note are: * The exact word is used as the basis of the feature, although our imple- mentation for TREC the actual test is case insensitive and is a substring match-so the feature fair would match with Fair or FAIR or fairs as well as with fairground. We do not, though, perform any stemming per Se. * No additional data structures are needed by CART. That is, we did not use any external data (e.g., thesauri or knowledge-bases) to help with tree construction. We spent approximately one-person month of effort in developing the 4. These are generated by CART from re-substitution error estimates and are, therefore, overly optirnistic. 214