ISR10
Scientific Report No. ISR-10 Information Storage and Retrieval
The Query-Document Matching Function
chapter
Joseph John Rocchio
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
4-52
proceeds exactly.as in pass I except for bypassing the density test. At
the end of pa ss.2,'.therefore, at[OCRerr]least.the required numbe[OCRerr]of initial
categories have been formed.
It should be clear that at the end of pass 2 not every document
has necessarily been used as an element of a classification subset.
Ilowever, those which have not can be assumed to be document images which
are relatively isolated in the index space. In general there are several
alternatives for dealing with such documents. In a dynamic enviror[OCRerr]ent,
i.e. one in which the collection is growing, there will be new documents
not yet classified. Isolated documents, then, could be grouped with
these in a category which is always searched in detail for all input
queries. At periodic intervals all such documents would be entered
into the *classificatibn system with the possibility of generating new
categories as the size of the cbllegtion increases. For [OCRerr]he current
study, however, the elimination of those documents which are in effect
hard to classify would bias the evaluation of the overall effectiveness
0£ the technique. [OCRerr]he obj[OCRerr]ective here then is to produce a set of
categories suitable for all documents in the test collection. To this
end a*thir[OCRerr] pass waS incorporated into the classification process.
At the comple'ti6n of' pass 2 each source document is assigned
* t& the classific*ation vector with which it has the highest correlation.
* This assi[OCRerr]ri[OCRerr][OCRerr]ent induces `a "par'[OCRerr]tition 6£ `the' collection such that
partition class i'contains all `dbc'uments which are closer to
classification vector i'than[OCRerr]to any other classification vector. In
pass S each partitidn' class is used as the `classification subset for a
new classification vector which will be similar to but not identical