SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Classification Trees for Document Routing, A Report on the TREC Experiment
chapter
R. Tong
A. Winkler
P. Gage
National Institute of Standards and Technology
Donna K. Harman
also had only one decision node, but correctly identified 94 of 149 relevant documents:
class 0 (0.126)
azt<=0 .50
class 1 (1.000)
Thus relevant documents are those that contain the word azt (the name of a drug for
treating AIDS patients).
4.2 The Effect of Additional Training Data
Our second set of official results was designed to investigate the sensitivity of system
performance to the size of the training sets. To provide additional training samples we
randomly selected a block of 50 Wall Street Journal articles7 for which we generated rele-
vance judgements for the first 25 topics. In practice this gave us some additional relevant
articles, but mostly contributed to the non-relevant examples. Table 2 shows the effect of
adding the additional documents-notice that some articles in the new set were already
included in the original training data.
Table 2: Size of Training Sets and Optimal Trees
Size of Optimal Tree Size of Training Set
Topic # adsbal absda2
adsbal absba2
Total Rel Total Rel
1 7 7 34 7 83 `7a
2 8 14 31 7 81 8
3 4 5 30 9 80 9
4 3 5 29 12 79 12
5 3 3 30 18 79 18a
6 3 2 28 15 78 15
7 2 3 28 10 78 10
8 1 19 32 7 82 8
9 4 4 23 4 73 4
10 2 4 20 12 69 12b
11 2 2 21 12 71 12
12 8 11 35 8 85 8
13 2 2 12 8 62 8
14 10 13 30 5 80 6
7. The articles were in the block starting with W5J870311-0102 and ending with W5J870324-O001.
218