SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Machine Learning for Knowledge-Based Document Routing (A Report on the TREC-2 Experiment) chapter R. Tong L. Appelbaum National Institute of Standards and Technology D. K. Harman Largest Tree (Tmax) mum:'mm Raw Feature Tree Tree Optimal Training Growing Tree (T*) Data (T Class Specs. Class Priors Feature Specs. Cost Function Nested Sub-Trees max>Ti >T2 ...>Tn) CART Training Vectors Figure 1: CART Processing How forms a cross-validation analysis on the sequence and selects that tree with the lowest cross-validation error.3 This is T*. In our TREC-2 system, we take T* and convert it into a TOPIC outline file as descrileed in Section 3. That is, rather than use CART itself to perform the routing on the unseen documents (as we did in ThEC-l), we use the CART trees as skeletons for TOPIC concepts, and then have TOPIC do the routing. An advantage of this is that we can make use of TOPIC's extensively optimized text database capabilities, thus allowing us to easily generate the output files needed for the official scoring program. 2.2 Data Structures Generated by CART To illustrate the processing that CART performs, we will use an example taken from the TREC-2 corpus. A example query is as follows: <top> <head> Tipster Topic Description <num> Number: 097 <dom> Domain: Science and Technology <title> Topic: Fiber Optics Applications <desc> Description: Document must identify instances of fiber optics technology actually in use. <narr> Narrative: To be relevant, a document must describe actual operational situations in which fiber optics are being employed, or will be employed. A document describing future fiber optics use will be relevant only if contracts have been signed concerning the future application. <con> Concept(s): 1. fiber optic, light 2. telephone, LAN, television <fac> Factor(s) <def> Definition(s) 1. Fiber optics refers to technology in which information is passed via laser light transmitted through glass or plastic fibers. <\top> This is a very comprehensive statement of infor- mation need and provides a rich set of features that we can use for CART.4 Our basic procedure is to extract from the information need statement all the unique content words and then stem them, which gives the following list: ACT APPL CONCERN CONTRACT DESCRIB DOCU EMPLOY FIB FUT GLASS INFORM LAN LAS LIGHT OPER OPT PAS PLAST REF RELEV SIGN SITU TECHNOLOG TELEPHON TELEV TRANSMIT VIA5 3. The algorithm actually minimizes with respect to both the cross-validation error and the tree complexity. So that if two trees have statistically indistinguishable error rates, then the smaller of the two trees will be selected as optimal. 254 4. In general, determining what set of features to use for CART is a matter of experience and judgement. For the [OCRerr]IREC-2 corpus we have made use of the information need statements, but other approaches, such as using all the unique words in the training set, are equally valid. In fact, it is this freedom of choice of features that gives CART a great deal of its flexibility.