NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)

SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Machine Learning for Knowledge-Based Document Routing (A Report on the TREC-2 Experiment) chapter R. Tong L. Appelbaum National Institute of Standards and Technology D. K. Harman are none of these in the current example), and lower values to features outside the optimal tree.8 At this point the spe- cific values chosen represent our "best guess" at a weighting scheme, further experimentation Will undoubtedly reveal a better strategy. As in the first canonical form, the overall weight for the TOPIC tree is based on the cross-validation rate for the maximal tree. 4 The TREC-2 Experiments For TREC-2 we again focused only on the docu- ment routing problem. Since our technique requires training data it does not easily lend itseff to the ad hoc retrieval problem and so rather than "force-fit" it we chose to gener- ate four sets of results for the routing queries (topics 51- 100). Fach set of results was generated totally automati- cally. The results sets are labell[OCRerr] adsl, ads2, ads3, and ads4, and the table below shows to which combinations of features and TOPIC models they correspond. Table 1: Results Identification Result Set Word Features TOPIC Model adsl stemmed model-i ads2 unstemmed model-i ads3 stemmed model-2 ad[OCRerr] unstemmed model-2 and two sets of training vectors labelled with the ground truth information.9 Since CART is a statisti- cally-oriented classifier, we decided to minimize the "noise" in the training sets by using only the Wall Street Journal articles identified in the qrel files. Fur- ther, for all but topics 80 and 81, we used just the Wall Street Journal articles on Disk 2. * Second, we grew the CART trees from this training data. Since we had two sets of training data for each topic, we grew two trees for each topic. * Third, we used the algorithms described in Section 3 to convert the CART trees into a TOPIC readable form. This produced four TOPIC definitions for each of the information need statements. Table 1 above shows the various combinations. * Fourth, we ran the TOPIC definitions against the indexed unseen data. 10 Again, to minimize noise effects, we used only the Associated Press articles on Disk 3 to generate our official results. * Fifth, we sorted and merged the results generated by TOPIC and converted them into the TREC format for scoring by NIST. 4.2 Discussion of Official Results The official results for adsl and ads2, together with the unofficial results for ads3 and adA, are shown in Table 2. Although we generated four sets of results, the resource constraints at MST resulted in only adsl and ads2 being officially scored. Reference in the remainder of the paper to scores associated with ads3 and adA are to the unolficial score generated by us using the TREC-2 scoring program and the published qrels for the routing topics. 4.1 The Experimental Procedure The experimental procedure for TRBC-2 consists of five basic steps. We briefly describe each of these: * First, we generated the CART training data from the information need statements and the ground truth files (i.e., the qrels) provided by NIST. This produced two feature sets for each topic (corresponding to the stemmed and unstemmed versions of the features), 8. A variable is in the optimal tree if its k-value is greater than k*; is on the fringe if k=k*; and outside the optimal tree if k[OCRerr]*. Note that in general the individual features appear at multiple locations in the tree. Our strategy is to remove duplicates by retaining the instance with the highest k-value. 258 Table 2: TREC-2 Results (AP Only) Run No. No. Rel. A[OCRerr] Exact ID Retr. Rel. Ret. Prec. Prec. adsi 40,423 5,677 822 0.0195 0.0390 ads2 33,034 5,677 1,468 0.0821 0.1092 ads3 49,006 5,677 1,182 0.0168 0.0374 adA 50,000 5,677 1,847 0.0630 0.0868 The first observation is that the trees built using exact words as features (i.e., results ads2 and adA) had higher precision than those built using word stems. We 9. The feature specification and extraction procedure we used is identical to that used in ThEC-1 and is described in detail in the ll[OCRerr]EC-l proceedings. The only differences are the addition of a stemmed version of the features and the fact that we do not make use of the feature count information. 10. We are grateful to Verity Inc. for allowing us to have access to their computer systems and databases.