ISR10 Scientific Report No. ISR-10 Information Storage and Retrieval The Query-Document Matching Function chapter Joseph John Rocchio Harvard University Gerard Salton Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. 4-24 known about their relative position in the index space. As the proce.ss develops, a document may become `cluStered" i.e. associated with a particular classification vector, or may be identified with the 11loose'1 state indicating that it[OCRerr]has been found to be oriented in a region of low density in the `index space. Unclustered documents are considered in sequence and `the first step consists in generating a measure of the distribution of'document'images' around the document being considered. This is' accomplished by'cdrre'lating this document with all documents except those which are in [OCRerr]he clustered state. The resulting correla- tion distribution is sorted into descending order (note that the correlation is inverse to the angular distance metric), i.e. into order of increasing distance and a [OCRerr] test is applied to determine if the region being considered (defined by the object document plus those unclustered documents in its[OCRerr]immediate vicinity) is dense enough for category formation. The' `density' test employed ( a flowchart is given in Figure[OCRerr]4.4)' requires `that `the correlation distribution exceed two test points as illustrated by'Figure'4.5. This test was chosen heuristically after experimenting with typical ddcument-document correlation distributidns. [OCRerr]f the density test fails, the document under consideration is marked "loose" and control returns to step 1 to consider the next unclustered document.' If' the density test is sajisfied, a cutoff correlation is determined' as a function of the category size limits and the distribution of `correlation values. The cutoff-determining algorithm `is illustrated in Figure 4.6. As the documents above the