SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Classification Trees for Document Routing, A Report on the TREC Experiment
chapter
R. Tong
A. Winkler
P. Gage
National Institute of Standards and Technology
Donna K. Harman
5. CART as the Kernel of a Document Routing System
Users today are faced with ever increasing amounts on real-time text that must be
sifted for relevant information. This information space is: massive-both in terms of the
number of sources and the volume of material available within each source; dynamic-
sources and their contents are changing constantly and especially within the time hori-
zon of any specific analysis problem; and heterogeneous[OCRerr]ach source represents infor-
mation in different ways and independently of the others.
In this environment, users require tools that can perform document filtering and
selection that are: easy to learn and use, easily adaptable to changing and ill-defined
information needs, portable across sources and analysis domains, and give good preci-
sion and recall. With our TREC results in hand, we are in a position to briefly consider
the potential role of CART as a central component in such an operational document rout-
ing system.
Figure 2 illustrates how a such a system might be organized. In our scenario, we
docum~~preprocessor~
I
Detector
~1~
Compiled
Profiles
relevant
documents[OCRerr] users
Router
I non-re[OCRerr][OCRerr]nt documents
U
A
f
I
Learning stored
Algorithms K documents
I
user
U-
real-time processing patn
Figure 2: A Document Routing System Concept
imagine that documents enter the system in real-time and after preprocessing to extract
features are passed to the detection module where they are either rejected as being non-
relevant to any of the user profiles stored in the system, or are marked as relevant and
passed to the router for transmission to users. Note that in our proposed architecture,
learning takes place in parallel to actual operation of the real-time routing system so that
the classification trees can be updated as necessary when new training instances become
226