ISR10 Scientific Report No. ISR-10 Information Storage and Retrieval The Query-Document Matching Function chapter Joseph John Rocchio Harvard University Gerard Salton Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. 4-38 indicates that 20 categories would be about optimal for a collection of 405 documents (assuming only a single category is searched in detail), however classifications of 20, 30, and 40 categories were experimentally produced for comparison purposes. The algorithm required from about 6 to 8 minutes, respectively for these classi£icatidn's and could undoubtedly be speeded up if it were repr[OCRerr]ogrsmmed for this purpose. Descriptive parameters of the classifications include the distributions of the cutoff cdrrelations and the average document- classification vector corr[OCRerr]lations which[OCRerr]ar& shown in Figure 4.10. To evaluate the effectiveness of the search optimization based on the classification induced stora[OCRerr] organization, the parameters of interest are: 1.) the consistency of retrieval with respect to all documents, i.e. does[OCRerr]the reduced search lead to retrieving the same documents as the full search, and [OCRerr].) the *consistency [OCRerr]f retrieval with respect to relevant documents, i.e. is th& retrieval of relevant documents altered by' the reduced search? To this `end each of "the sample search requests w'as corr[OCRerr]late'd with the set 6£' `c''lassification vectors for the three classifications. Figur'e[OCRerr]4.11 show's the correlation distributions for one of the' test qu'eries with [OCRerr]he[OCRerr]vec'tors of each of the classifications. * For each of the cla'ssifica[OCRerr]ions (20, 30, and 40 categories) the fiv'e highest correlating cate'gbries for each query were recorded. The documents'containe'd inthe unian' of the first' through fifth' of such categorie's' were' the'n corn' pared `with[OCRerr]'the first 15 and first 30 documents retrieved by a full `se'arch. In ad'dition the `humber 0£ relevant `documents' in e[OCRerr][OCRerr]ch of' these' ca'te"'gory[OCRerr]retrieved subsets was `computed. Assuming then that from 1 through 5 cate'gor'ies would be searched in