ISR10 Scientific Report No. ISR-10 Information Storage and Retrieval The Query-Document Matching Function chapter Joseph John Rocchio Harvard University Gerard Salton Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. 4-52 proceeds exactly.as in pass I except for bypassing the density test. At the end of pa ss.2,'.therefore, at[OCRerr]least.the required numbe[OCRerr]of initial categories have been formed. It should be clear that at the end of pass 2 not every document has necessarily been used as an element of a classification subset. Ilowever, those which have not can be assumed to be document images which are relatively isolated in the index space. In general there are several alternatives for dealing with such documents. In a dynamic enviror[OCRerr]ent, i.e. one in which the collection is growing, there will be new documents not yet classified. Isolated documents, then, could be grouped with these in a category which is always searched in detail for all input queries. At periodic intervals all such documents would be entered into the *classificatibn system with the possibility of generating new categories as the size of the cbllegtion increases. For [OCRerr]he current study, however, the elimination of those documents which are in effect hard to classify would bias the evaluation of the overall effectiveness 0£ the technique. [OCRerr]he obj[OCRerr]ective here then is to produce a set of categories suitable for all documents in the test collection. To this end a*thir[OCRerr] pass waS incorporated into the classification process. At the comple'ti6n of' pass 2 each source document is assigned * t& the classific*ation vector with which it has the highest correlation. * This assi[OCRerr]ri[OCRerr][OCRerr]ent induces `a "par'[OCRerr]tition 6£ `the' collection such that partition class i'contains all `dbc'uments which are closer to classification vector i'than[OCRerr]to any other classification vector. In pass S each partitidn' class is used as the `classification subset for a new classification vector which will be similar to but not identical