SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Classification Trees for Document Routing, A Report on the TREC Experiment chapter R. Tong A. Winkler P. Gage National Institute of Standards and Technology Donna K. Harman Note that we do no further processing of the topic text. We do no phrasal analysis, no stemming, no term expansion and, in particular, we do nothing with the NOT-clauses. Applying this procedure to the topic above results in the following list of feature words: acquisition: antitrust: commerce: commission: department: exchange: ftc: fair: federal: icc: identification: interstate: japanese: justice: laws: merger: sec: securities: trade: acquir- ing: action: alleged: bid: business: buyout: called: companies: company: complaint: controlling: corporation: discuss: dispute: dissolves: document: entity: fixing: friendly: government: iden- tify: identity: illegal: immunity: industry: insider: investigat- ing: investigation: involved: merger: monopolies: monopoly: objections: optional: pending: practices: price: protect: rele- vant: restraint: restraints: result: retains: retires: review: rigging: routine: stock: suit: takeover: taking: trading: unfair: unfriendly: unlawful: This feature specification is used as input to the Feature Extraction module (see upper part of Figure 1) along with the original training data and the relevance judgements associated with the topic. A key characteristic of the training data provided for TREC is that there were very few relevance judgments for each topic. Thus we were only able to generate a limited number of training vectors to be used by the CART algorithm. For each document in the training set for which there is a relevance judgement, we determine the number of occur- rences of each feature (using only the <text> fields). Thus for each topic we get a set of training vectors that looks like those shown below (augmented here, for presentation purposes, with the topic and document ids): 001 QO WSJ870320-0188 (0 (0,1,0,5,0,0,3...6)) 001 QO W[OCRerr]J870320-0069 (0 (2,0,9,l,0,4,l,...,0)) 001 QO WSJ870424-0084 (1 (l,3,0,5,2,0,0,...,2)) Where in each vector the first element indicates whether the document is relevant to the topic (0 mean no, 1 mean yes) and the remainder of the elements represent counts of the occurrence of the features in the document. Thus document WSJ870320-0188 is not-rele- vant to Topic 1 and contains zero occurrences of the first feature, one occurrence of the second feature, etc. This training data is input to the Tree Growing module (see Figure 1) along with the cost function and class priors3. CART then grows the largest tree, prunes this into a set of nested sub-trees and uses cross validation to suggest the optimal tree. For the example 3. In our "official" TREC experiments we set the priors to be P(rel)=O.O1 and P(non-rel)=O.99, and the ratio of the cost of misclassifying a relevant document to the cost of misclassifying a non-rele- vant document to be 100:1. That is, we assumed that there are relatively few relevant documents in the collection and that we want to emphasize recall significantly over precision. 213