NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)

SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Machine Learning for Knowledge-Based Document Routing (A Report on the TREC-2 Experiment) chapter R. Tong L. Appelbaum National Institute of Standards and Technology D. K. Harman seemed unsatisfactory. We have not been able to identify any specific conditions under which we could expect the CART trees to perform well. We believe, nevertheless, that further experiments to assess the utility of using a variety of extensions to the basic CART technique would still be of interest. In particu- lar, the use of "low-level" concepts as features, surrogate split information, larger training sets, and features drawn from the complete corpus rather than the information need statements, are all likely to assist in the construction of more effective trees. The main focus of our TREC-2 effort has been to explore the idea that the CART trees can form the basis of a semi-automated approach to building knowledge-based descriptions of routing topics. To that end, we developed two techniques for converting CART output into a form that can be used by the TOPIC retrieval engine from Verity, Inc. In the first technique we perform a "lossless" transformation of the CART tree into a TOPIC tree. In the second we gener- alize the CART tree, although add no new features. As the examples contained in the paper show, the TOPIC trees generated in the first canonical form are "skel- etal." This is just as we would expect. Since CART is a par- simonious classifier, rather than a broad-based information retrieval tool, it produces minimally complex decision trees with respect to the expected misclassification rate. From an information retrieval perspective, the sec- ond canonical form seems more like the ones we might have built by hand. It uses a range of features and gives them weights based on their ability to discriminate among the tralning data. In effect, we have used CART to indicate wliich of the features are of use in defining the topic and then generalized the CART decision function using our (external) knowledge of the information retrieval problem. Using these automatically constructed TOPIC trees as a starting point, we conducted a limited series of tests to assess the impact of performing minimal "editing" of these trees. For the two topics selected, we were able to produce significant performance galns with edits that added no new information (at least at the level of the features used) and that took of the order of only a few minutes to implement. We are encouraged by these results, and, while the generalization of them will require a more carefully con- trolled series of experiments, we are now of the opinion that the most effective role for machine learning techniques in information retrieval is as a tool for producing candidate descriptions of information need. These candidates can then be reviewed by end-users who can easily make obvious cor- 263 rections and modifications. We intend to explore this idea in more detail and report on our results in ThEC-3. 6 Future Research There a number of directions in which we might develop the basic research ideas presented in this paper. We briefly consider a number of them here. We currenfly use just two classes (relevant and not relevant), but nothing in CART prevents it working with multiple classes. For the document routing problem, there is a case to be made for adding a third class - unknown rele- vance. Adding such a class might allow us to make use of larger tralning sets without the costs associated with devel- oping large ground truths. One way of extending the skeleton TOPIC trees produced by our tool is to make use of external lexical resources. For example, we might investigate the use of WordNet as a way to expand each of the classification fea- tures into a set of related words. Similarly, we might investi- gate the use of TOPIC's own lexical resources (e.g., the thesaurus and Soundex tools) by replacing the unstemmed words in the topic outline files with the appropriate TOPIC operator (e.g., <SQUNDEX> word, or <THESARUS> word). Mthough we have used CART as the module for building the initial classification tree, we might be interested in exploring other tree building tools that have been used in the machine learning community. For example, the C4.5 algorithm by Quinlan [6], or various algorithms based on Bayesian methods such as Minimum Message Length mod- els and decision graphs [7]. All of these tools generate deci- sion trees from training data but offer different mathematical philosophies to justify their approaches. Finally, as in all machine learning problems, the initial choice of features over which to learn is extremely critical to the overall success of the process. An investiga- tion of various extended feature definition tools (e.g., recog- nizing key phrase and proper names), as well as exploring the impact of making different assumptions from TOPIC about how the lexical tokens in the texts are to be treated, would almost certainly yield important insights. References 1. Breiman, L.; Friedman, J. H.; Olshen, R. A.; Stone, C. J. Classification and regression trees. Pacific Grove, CA: Wadsworth & Brooks; 1984.