ISR10
Scientific Report No. ISR-10 Information Storage and Retrieval
The Query-Document Matching Function
chapter
Joseph John Rocchio
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
4-2[OCRerr]
have been either clustered or marked loose. Note that as the documents
are clustered,.they are removed from the process of identifying initial
category subsets,. This strategy prevents the generation of classifica-
tion subsets with large'overlap and materially reduces the number of
correlations required. Since it is reasonable to expect that some
docume'nts `should be" multiply clas'sified, the classification vectors
are thems'el'ves correlated with the e' ntire collection. In this manner,
previously `clustered documents can appear above the cutoff for a
given c'lassification vector and thus be associated with more than one
k category. Figure' 4.7(a')"illustrates a [OCRerr]correlation distribution of
unclu'stere'd docum'ents which leads to' a classification vector (shown in
Figure' 4.8(a)') "and part" (b) s,ho'ws apart of the correlation
distribution of"thi's' classification vector with the entire collection.
Sinc'e `there' is no a prio'ri way to establish exactly how many
categories will be forme"d'by this ini#ial pass through the collection,
a second' pass is used' in case the number formed is less than specified.
(N'ot'e that'more tha'n'th.e'sp'eci'fi'ed number of categories could be
formed' during' pass[OCRerr]'i, but [OCRerr][OCRerr],this: `wduld imply that the density test could
be made[OCRerr]rnore res'trict'ive"or that the category size limit could be
increas'ed.) iurin[OCRerr] pass'1, `the' initial part of the sorted correlation
list for all doc'uments failiri[OCRerr] the density test is saved on tape. In
pass 2 `this' lis't"is[OCRerr]sc'a'n'ned[OCRerr]and a measure of the unclustered document
density around suc'h documents is computed. The maximum values' of this
measur& (which `is j[OCRerr]5t'. the s'um' df a fixed number of the sorted
`cdrrelations)' are us'ed"t'o's'e"te'd't `additional classification regions
until the specified number'of categ6ries has been formed. The algorithm