ISR10 Scientific Report No. ISR-10 Information Storage and Retrieval The Query-Document Matching Function chapter Joseph John Rocchio Harvard University Gerard Salton Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. 4-2[OCRerr] have been either clustered or marked loose. Note that as the documents are clustered,.they are removed from the process of identifying initial category subsets,. This strategy prevents the generation of classifica- tion subsets with large'overlap and materially reduces the number of correlations required. Since it is reasonable to expect that some docume'nts `should be" multiply clas'sified, the classification vectors are thems'el'ves correlated with the e' ntire collection. In this manner, previously `clustered documents can appear above the cutoff for a given c'lassification vector and thus be associated with more than one k category. Figure' 4.7(a')"illustrates a [OCRerr]correlation distribution of unclu'stere'd docum'ents which leads to' a classification vector (shown in Figure' 4.8(a)') "and part" (b) s,ho'ws apart of the correlation distribution of"thi's' classification vector with the entire collection. Sinc'e `there' is no a prio'ri way to establish exactly how many categories will be forme"d'by this ini#ial pass through the collection, a second' pass is used' in case the number formed is less than specified. (N'ot'e that'more tha'n'th.e'sp'eci'fi'ed number of categories could be formed' during' pass[OCRerr]'i, but [OCRerr][OCRerr],this: `wduld imply that the density test could be made[OCRerr]rnore res'trict'ive"or that the category size limit could be increas'ed.) iurin[OCRerr] pass'1, `the' initial part of the sorted correlation list for all doc'uments failiri[OCRerr] the density test is saved on tape. In pass 2 `this' lis't"is[OCRerr]sc'a'n'ned[OCRerr]and a measure of the unclustered document density around suc'h documents is computed. The maximum values' of this measur& (which `is j[OCRerr]5t'. the s'um' df a fixed number of the sorted `cdrrelations)' are us'ed"t'o's'e"te'd't `additional classification regions until the specified number'of categ6ries has been formed. The algorithm