SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Classification Trees for Document Routing, A Report on the TREC Experiment
chapter
R. Tong
A. Winkler
P. Gage
National Institute of Standards and Technology
Donna K. Harman
Note that we do no further processing of the topic text. We do no phrasal analysis, no
stemming, no term expansion and, in particular, we do nothing with the NOT-clauses.
Applying this procedure to the topic above results in the following list of feature
words:
acquisition: antitrust: commerce: commission: department:
exchange: ftc: fair: federal: icc: identification: interstate:
japanese: justice: laws: merger: sec: securities: trade: acquir-
ing: action: alleged: bid: business: buyout: called: companies:
company: complaint: controlling: corporation: discuss: dispute:
dissolves: document: entity: fixing: friendly: government: iden-
tify: identity: illegal: immunity: industry: insider: investigat-
ing: investigation: involved: merger: monopolies: monopoly:
objections: optional: pending: practices: price: protect: rele-
vant: restraint: restraints: result: retains: retires: review:
rigging: routine: stock: suit: takeover: taking: trading: unfair:
unfriendly: unlawful:
This feature specification is used as input to the Feature Extraction module (see upper
part of Figure 1) along with the original training data and the relevance judgements
associated with the topic.
A key characteristic of the training data provided for TREC is that there were very
few relevance judgments for each topic. Thus we were only able to generate a limited
number of training vectors to be used by the CART algorithm. For each document in the
training set for which there is a relevance judgement, we determine the number of occur-
rences of each feature (using only the <text> fields). Thus for each topic we get a set of
training vectors that looks like those shown below (augmented here, for presentation
purposes, with the topic and document ids):
001 QO WSJ870320-0188 (0 (0,1,0,5,0,0,3...6))
001 QO W[OCRerr]J870320-0069 (0 (2,0,9,l,0,4,l,...,0))
001 QO WSJ870424-0084 (1 (l,3,0,5,2,0,0,...,2))
Where in each vector the first element indicates whether the document is relevant to the
topic (0 mean no, 1 mean yes) and the remainder of the elements represent counts of the
occurrence of the features in the document. Thus document WSJ870320-0188 is not-rele-
vant to Topic 1 and contains zero occurrences of the first feature, one occurrence of the
second feature, etc.
This training data is input to the Tree Growing module (see Figure 1) along with the
cost function and class priors3. CART then grows the largest tree, prunes this into a set of
nested sub-trees and uses cross validation to suggest the optimal tree. For the example
3. In our "official" TREC experiments we set the priors to be P(rel)=O.O1 and P(non-rel)=O.99, and
the ratio of the cost of misclassifying a relevant document to the cost of misclassifying a non-rele-
vant document to be 100:1. That is, we assumed that there are relatively few relevant documents in
the collection and that we want to emphasize recall significantly over precision.
213