ISR10
Scientific Report No. ISR-10 Information Storage and Retrieval
The Query-Document Matching Function
chapter
Joseph John Rocchio
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
4-2~
cutoff are to be used as an initial category, this algorithm must
accomplish several objectives. First, the subset must be constrained by
the maximum and minimum size limits. Further, a region of high
document density should yield a larger subset than a region of low
density. Thus within the size constraints, documents with correlation
above[OCRerr]min are automatically placed above cutoff. If the correlations
fall below [OCRerr]in before the size limit is ezceeded, the cutoff is chosen
at the greatest correlation[OCRerr]difference (Figlire 4.6) in the distribution.
This produces, in effect, the sharpest boundary between the identified
subset and neighboring unclustered ddcuments. A classification vector
following equation(4.2) `is now[OCRerr]formed for the subset so identified,
and a scaled,' truncate dversion of it is then correlated with the entire
source collection, thereby identifying documents centered around it.
The resultant correlation[OCRerr] distribution of the classification vector is
s6rted into descending order and the cutoff algorithm is reapplied. In
this case all documents above'cutoff with correlation greater than the
minimum clustering correlation (an input parameter) are marked clustered
Docume nts `above the cutoff but with'cbrrelation lower than this minimum
are marked loose. This prevent's suchd6cuments which are clearly related
to the category just formed `(i.e[OCRerr]. they are above the cutoff) from
be'6oming candidates for new cluster centers' at this stage of the process.
Atthis point control passes to step 1.
This [OCRerr]first pa[OCRerr]s th[OCRerr][OCRerr]ough the collection' ends when all documents
[OCRerr][OCRerr]min represents a correlation significantly' above the average
document-document correlation of the collection.