ISR10
Scientific Report No. ISR-10 Information Storage and Retrieval
The Query-Document Matching Function
chapter
Joseph John Rocchio
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
4-24
known about their relative position in the index space. As the proce.ss
develops, a document may become `cluStered" i.e. associated with a
particular classification vector, or may be identified with the 11loose'1
state indicating that it[OCRerr]has been found to be oriented in a region of
low density in the `index space. Unclustered documents are considered
in sequence and `the first step consists in generating a measure of the
distribution of'document'images' around the document being considered.
This is' accomplished by'cdrre'lating this document with all documents
except those which are in [OCRerr]he clustered state. The resulting correla-
tion distribution is sorted into descending order (note that the
correlation is inverse to the angular distance metric), i.e. into order
of increasing distance and a [OCRerr] test is applied to determine if the
region being considered (defined by the object document plus those
unclustered documents in its[OCRerr]immediate vicinity) is dense enough for
category formation. The' `density' test employed ( a flowchart is given in
Figure[OCRerr]4.4)' requires `that `the correlation distribution exceed two test
points as illustrated by'Figure'4.5. This test was chosen heuristically
after experimenting with typical ddcument-document correlation
distributidns.
[OCRerr]f the density test fails, the document under consideration is
marked "loose" and control returns to step 1 to consider the next
unclustered document.' If' the density test is sajisfied, a cutoff
correlation is determined' as a function of the category size limits
and the distribution of `correlation values. The cutoff-determining
algorithm `is illustrated in Figure 4.6. As the documents above the