ISR10 Scientific Report No. ISR-10 Information Storage and Retrieval The Query-Document Matching Function chapter Joseph John Rocchio Harvard University Gerard Salton Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. 4-2~ cutoff are to be used as an initial category, this algorithm must accomplish several objectives. First, the subset must be constrained by the maximum and minimum size limits. Further, a region of high document density should yield a larger subset than a region of low density. Thus within the size constraints, documents with correlation above[OCRerr]min are automatically placed above cutoff. If the correlations fall below [OCRerr]in before the size limit is ezceeded, the cutoff is chosen at the greatest correlation[OCRerr]difference (Figlire 4.6) in the distribution. This produces, in effect, the sharpest boundary between the identified subset and neighboring unclustered ddcuments. A classification vector following equation(4.2) `is now[OCRerr]formed for the subset so identified, and a scaled,' truncate dversion of it is then correlated with the entire source collection, thereby identifying documents centered around it. The resultant correlation[OCRerr] distribution of the classification vector is s6rted into descending order and the cutoff algorithm is reapplied. In this case all documents above'cutoff with correlation greater than the minimum clustering correlation (an input parameter) are marked clustered Docume nts `above the cutoff but with'cbrrelation lower than this minimum are marked loose. This prevent's suchd6cuments which are clearly related to the category just formed `(i.e[OCRerr]. they are above the cutoff) from be'6oming candidates for new cluster centers' at this stage of the process. Atthis point control passes to step 1. This [OCRerr]first pa[OCRerr]s th[OCRerr][OCRerr]ough the collection' ends when all documents [OCRerr][OCRerr]min represents a correlation significantly' above the average document-document correlation of the collection.