ISR10
Scientific Report No. ISR-10 Information Storage and Retrieval
The Query-Document Matching Function
chapter
Joseph John Rocchio
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
4-20
where [OCRerr] is [OCRerr]efine[OCRerr] accor[OCRerr]ing to e[OCRerr]uation (4.1). The classification
vector c (or more precisely its orientation) is the best single
representation for all of the elements in the set D un[OCRerr]er the assumption
that the information carrie[OCRerr] by an in[OCRerr]ex vector is containe& in its
angular position.
In the geometrical interpretation, the vector [OCRerr]c/Icl terminates
at the centroi& of the point [OCRerr]istribution on the unit.N-sphere
representing the vectors d[OCRerr]/. 1[OCRerr][OCRerr]1 In particular, then, if the elements
in D are sufficiently close to one another, c must be close to all of
them. With respect to the classification problem, if the members of D
are tQ be groupe[OCRerr] into [OCRerr] classification category, c can be consi&ere[OCRerr] to
*b& the best classification 11hea&" or repre[OCRerr]sentation for the category.
This property of the centroi[OCRerr] vector together with the metric properties
of angular query-&ocument matching will be use[OCRerr] as a basis for an
automatic classification algorithm suitable for storage organization in
the vector in[OCRerr]exing mo[OCRerr]el.
3. Description of the Classification Algorithm
The objective of the classification process is to generate a
set of categories or document subsets, each represented by a classifica-
tion vector (equation; (4.2)) from the source collection. The properties
of the classification system should result in increased search
efficiency in a document retrieval system. The storage organization
induced bya classification of this type leads to a two-level search
algorithm. Consider `an input item"which is.to be compared with each
member of a collection of N elements so that those elements which