ISR10
Scientific Report No. ISR-10 Information Storage and Retrieval
The Query-Document Matching Function
chapter
Joseph John Rocchio
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
4-38
indicates that 20 categories would be about optimal for a collection of
405 documents (assuming only a single category is searched in detail),
however classifications of 20, 30, and 40 categories were experimentally
produced for comparison purposes. The algorithm required from about 6
to 8 minutes, respectively for these classi£icatidn's and could
undoubtedly be speeded up if it were repr[OCRerr]ogrsmmed for this purpose.
Descriptive parameters of the classifications include the
distributions of the cutoff cdrrelations and the average document-
classification vector corr[OCRerr]lations which[OCRerr]ar& shown in Figure 4.10. To
evaluate the effectiveness of the search optimization based on the
classification induced stora[OCRerr] organization, the parameters of interest
are: 1.) the consistency of retrieval with respect to all documents,
i.e. does[OCRerr]the reduced search lead to retrieving the same documents as
the full search, and [OCRerr].) the *consistency [OCRerr]f retrieval with respect to
relevant documents, i.e. is th& retrieval of relevant documents altered
by' the reduced search? To this `end each of "the sample search requests
w'as corr[OCRerr]late'd with the set 6£' `c''lassification vectors for the three
classifications. Figur'e[OCRerr]4.11 show's the correlation distributions for
one of the' test qu'eries with [OCRerr]he[OCRerr]vec'tors of each of the classifications.
* For each of the cla'ssifica[OCRerr]ions (20, 30, and 40 categories) the
fiv'e highest correlating cate'gbries for each query were recorded. The
documents'containe'd inthe unian' of the first' through fifth' of such
categorie's' were' the'n corn' pared `with[OCRerr]'the first 15 and first 30 documents
retrieved by a full `se'arch. In ad'dition the `humber 0£ relevant
`documents' in e[OCRerr][OCRerr]ch of' these' ca'te"'gory[OCRerr]retrieved subsets was `computed.
Assuming then that from 1 through 5 cate'gor'ies would be searched in