ISR10
Scientific Report No. ISR-10 Information Storage and Retrieval
The Query-Document Matching Function
chapter
Joseph John Rocchio
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
4-19
as [OCRerr]- dimensional Cartesian vectors. Using the angi[OCRerr]l-ar distance
similarity measure, it i[OCRerr] clear that a classification categQry should
consist of a set of document images confined within a localized hyper-
cone of the index space. Alternatively, if the index images are
pictured as unit vectors terminating on the unit [OCRerr]-sphere, a
classification category should consist of a set of documents represent[OCRerr][OCRerr]
[OCRerr]index vectors terminating within some local area on the surface of
the unit [OCRerr]-sphere In these terms the problem of automatic document
classification is to define the characteristics of such areas and to
establish a procedure for identifying and representing them.
5. A Heuristic ClassificationAlgoritbm
A. Basic Concepts
Associated with an arbitrary set of document index vectors D,
a classification vector c is defined by the equation
c =[OCRerr] Zn d[OCRerr] (4.2)
i=1
where L = [OCRerr]d1,d2,... [OCRerr]dn} The vector c is the centrdid or center of
gravity of the set of unit vectors d[OCRerr]/I d[OCRerr]l derived from the elements
of D and represents, then, a vector with an orientation for which
n
F e(c,d[OCRerr]) = 0
i=1