ISR10
Scientific Report No. ISR-10 Information Storage and Retrieval
The Query-Document Matching Function
chapter
Joseph John Rocchio
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
with the one on which it is base[OCRerr]. Each of these final classification
vectors is again correlate[OCRerr][OCRerr]with theentire [OCRerr]ocument collection to
[OCRerr]efine the resultant set of categories. At this point a [OCRerr]ocument is
associate[OCRerr] with a category if it is above the cutoff of the classifica-
tion vector of that cate[OCRerr]ory, or if it is not above any cutoff but is
closest to sai[OCRerr] classification vector. Figure 4.7(c) illustrates the
partition class which results in the classification vector of Fi[OCRerr]tire
4.8(b); the correlation [OCRerr]istributidn of this vector, which specifies
the final ca[OCRerr]gory, is shown in Figure 4.7(a).
At the ena of the classification process, then, each
classification vector represents all the aocuments with inaex vectors
within the angular aistance corresponding to its cutoff correlation,
and additionally, a few documents outside this radius. Documents of
the latter typehowever, are closer t&the vectors to which they are
assigned than to any others of the set. [OCRerr]ote that the final
classification vectors are not necessarily the centroid vectors of the
vector subset they represent since the final categories are not in
general identical to the partition class from which the centroid
vector was formed. However, the final categories generally contain the
members of the partition class in[OCRerr]addition to documents which are
* multiply classified. This strategy provides a convenient means for
generating multiple classifications for some documents, while
maintaining a set of categories balanced over the entire collection.
Table 4.3 summarizes the main parts of the classification algorithm
and an Qverall flowchart is given in Figure 4.[OCRerr].
[(¼