SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Machine Learning for Knowledge-Based Document Routing (A Report on the TREC-2 Experiment)
chapter
R. Tong
L. Appelbaum
National Institute of Standards and Technology
D. K. Harman
are none of these in the current example), and lower values
to features outside the optimal tree.8 At this point the spe-
cific values chosen represent our "best guess" at a weighting
scheme, further experimentation Will undoubtedly reveal a
better strategy. As in the first canonical form, the overall
weight for the TOPIC tree is based on the cross-validation
rate for the maximal tree.
4 The TREC-2 Experiments
For TREC-2 we again focused only on the docu-
ment routing problem. Since our technique requires training
data it does not easily lend itseff to the ad hoc retrieval
problem and so rather than "force-fit" it we chose to gener-
ate four sets of results for the routing queries (topics 51-
100). Fach set of results was generated totally automati-
cally. The results sets are labell[OCRerr] adsl, ads2, ads3, and
ads4, and the table below shows to which combinations of
features and TOPIC models they correspond.
Table 1: Results Identification
Result Set Word Features TOPIC Model
adsl stemmed model-i
ads2 unstemmed model-i
ads3 stemmed model-2
ad[OCRerr] unstemmed model-2
and two sets of training vectors labelled with the
ground truth information.9 Since CART is a statisti-
cally-oriented classifier, we decided to minimize the
"noise" in the training sets by using only the Wall
Street Journal articles identified in the qrel files. Fur-
ther, for all but topics 80 and 81, we used just the
Wall Street Journal articles on Disk 2.
* Second, we grew the CART trees from this training
data. Since we had two sets of training data for each
topic, we grew two trees for each topic.
* Third, we used the algorithms described in Section 3
to convert the CART trees into a TOPIC readable
form. This produced four TOPIC definitions for each
of the information need statements. Table 1 above
shows the various combinations.
* Fourth, we ran the TOPIC definitions against the
indexed unseen data. 10 Again, to minimize noise
effects, we used only the Associated Press articles on
Disk 3 to generate our official results.
* Fifth, we sorted and merged the results generated by
TOPIC and converted them into the TREC format for
scoring by NIST.
4.2 Discussion of Official Results
The official results for adsl and ads2, together with
the unofficial results for ads3 and adA, are shown in
Table 2.
Although we generated four sets of results, the
resource constraints at MST resulted in only adsl and ads2
being officially scored. Reference in the remainder of the
paper to scores associated with ads3 and adA are to the
unolficial score generated by us using the TREC-2 scoring
program and the published qrels for the routing topics.
4.1 The Experimental Procedure
The experimental procedure for TRBC-2 consists
of five basic steps. We briefly describe each of these:
* First, we generated the CART training data from the
information need statements and the ground truth
files (i.e., the qrels) provided by NIST. This produced
two feature sets for each topic (corresponding to the
stemmed and unstemmed versions of the features),
8. A variable is in the optimal tree if its k-value is greater than k*;
is on the fringe if k=k*; and outside the optimal tree if k[OCRerr]*. Note
that in general the individual features appear at multiple locations
in the tree. Our strategy is to remove duplicates by retaining the
instance with the highest k-value.
258
Table 2: TREC-2 Results (AP Only)
Run No. No. Rel. A[OCRerr] Exact
ID Retr. Rel. Ret. Prec. Prec.
adsi 40,423 5,677 822 0.0195 0.0390
ads2 33,034 5,677 1,468 0.0821 0.1092
ads3 49,006 5,677 1,182 0.0168 0.0374
adA 50,000 5,677 1,847 0.0630 0.0868
The first observation is that the trees built using
exact words as features (i.e., results ads2 and adA) had
higher precision than those built using word stems. We
9. The feature specification and extraction procedure we used is
identical to that used in ThEC-1 and is described in detail in the
ll[OCRerr]EC-l proceedings. The only differences are the addition of a
stemmed version of the features and the fact that we do not make
use of the feature count information.
10. We are grateful to Verity Inc. for allowing us to have access to
their computer systems and databases.