SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Classification Trees for Document Routing, A Report on the TREC Experiment chapter R. Tong A. Winkler P. Gage National Institute of Standards and Technology Donna K. Harman 5. CART as the Kernel of a Document Routing System Users today are faced with ever increasing amounts on real-time text that must be sifted for relevant information. This information space is: massive-both in terms of the number of sources and the volume of material available within each source; dynamic- sources and their contents are changing constantly and especially within the time hori- zon of any specific analysis problem; and heterogeneous[OCRerr]ach source represents infor- mation in different ways and independently of the others. In this environment, users require tools that can perform document filtering and selection that are: easy to learn and use, easily adaptable to changing and ill-defined information needs, portable across sources and analysis domains, and give good preci- sion and recall. With our TREC results in hand, we are in a position to briefly consider the potential role of CART as a central component in such an operational document rout- ing system. Figure 2 illustrates how a such a system might be organized. In our scenario, we docum~~preprocessor~ I Detector ~1~ Compiled Profiles relevant documents[OCRerr] users Router I non-re[OCRerr][OCRerr]nt documents U A f I Learning stored Algorithms K documents I user U- real-time processing patn Figure 2: A Document Routing System Concept imagine that documents enter the system in real-time and after preprocessing to extract features are passed to the detection module where they are either rejected as being non- relevant to any of the user profiles stored in the system, or are marked as relevant and passed to the router for transmission to users. Note that in our proposed architecture, learning takes place in parallel to actual operation of the real-time routing system so that the classification trees can be updated as necessary when new training instances become 226