ISR10
Scientific Report No. ISR-10 Information Storage and Retrieval
The Query-Document Matching Function
chapter
Joseph John Rocchio
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
4-2[OCRerr]
subset of in[OCRerr]ex vectors for category formation are base& on the number
of elements in the subset[OCRerr]as well as the mutual distance among the
elements. Under these conditions a region of the index space with a
high density ofdocument vectors will yield categories in which all the
documents are closely related (via' the distance function) whereas in
regions of relatively[OCRerr]low density, categories covering a wider scope
will be formed; [OCRerr]ote'that'as the mutual `distance among the members of
a classification category increases, the classification vector becomes
less representative of the group as a whole. There is therefore a
definite tradeoff in category formation between producing categories
of equal population on the one hand', and maintaining control of the
distance relation among category members on the other.
Control of the' classification categories is achieved by a set
of input parameters to the algorithm which specify:
1. The number of categories desired
2. A lower `and upper bound on the number of elements to be
included in any classification subset
3. An upper bound on the' distance (lower bound on the
correlation coef'ficient) between a document and a
classificatibn"vector such that the document is still
considered td be associated with that vector.
In the' course of the classification process each document may
be associate'd w'ith o'ne'of `tbree possible states. Initially, all
documents are' `con'side're'd `to be Itunc'lusteredll, implying that they have'
not `[OCRerr]een assign'ed to `[OCRerr]ny `classifidati9n category, nor is anything
(