SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Classification Trees for Document Routing, A Report on the TREC Experiment
chapter
R. Tong
A. Winkler
P. Gage
National Institute of Standards and Technology
Donna K. Harman
Table 2: Size of Training Sets and Optimal Trees
Size of Optimal Tree Size of Training Set
Topic # adsbal absda2
adsbal absba2
Total Rel Total Rel
15 3 13 27 7 77 10
16 2 3 36 3 86 3
17 2 2 30 12 80 12
18 2 3 18 14 68 14
19 2 3 30 12 80 15
20 2 3 30 14 80 14
21 2 2 33 12 82 12b
22 2 2 28 10 78 10
23 2 2 29 10 79 10
24 2 2 48 10 98 10
25 2 2 39 6 89 6
a. Some relevant documents in the augmented training set already in the original training set.
b. Some non-relevant documents in the augmented training set already in the original training
set.
The new training data did not have a significant impact on the size of optimal tree,
except in those cases where there were additional relevant documents-that is, for topics
6, 8, 14, 15 and 19. The changes here were quite dramatic. For example, in the case of
Topic 8 "Economic Projections" the addition of just one more relevant article changed the
optimal tree from one with only one terminal node to one with 19! This suggests, of
course, that for this topic the training data do not provide a very representative sample
of texts.
Of more interest, however, is whether these additional training data had any effect on
the overall performance of the system. Table 3 shows the official results for absda2. The
results for adsbal are retained for comparison, as are the results for the other Category B
systems.
Table 3: Performance with Additional Training Data
Rel-Ret @ 200
Topic# #Rel -_________ _________ _________
adsbal absba2 Max Median Mm
1 131 2 25 67 32 2
2 172 15 9 33 21 9
219