Scientific Report No. IRS-13 Information Storage and Retrieval

IRS13 Scientific Report No. IRS-13 Information Storage and Retrieval Summary summary Harvard University Gerard Salton Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. single number serves as an indication of system performance, and continuous recall-precision curves. The construction of the recall-precision curves is described in detail in section II, as are the methods used for produ[OCRerr]inq curves averaged over many search requests. Extensions of the basic eval[OCRerr][OCRerr]- ation techniques are also (9iscussed to cover cases where variable relevance grades are assigned to documents, and to indicate the problems inherent in a comparison between experimental and operational retrieval systems. Detailed test results, covering the correlation methods used to compare analyzed documents with analyzed search requests are given in section III. The analysis includes, in particular, a comparison between the "overlap" coefficient which represents a measure proportional to the number of matching terms, and the "cosine" coefficient which takes into account also the total number of terms present in a given document. In each case, the terms are either weighted in accordance with their pre- sumed importance, or unweighted. The conclusion reached is that the cosine correlation used with the weighted content identifiers produces a superior retrieval performance in comparison with the other possible correlation procedures. Ten additional correlation measures are examined in section Iv by K. Reitsma and J. sagalyn, using the ADI collection for test purposes. The coefficients used include, in particular, the cosine function, the overlap measure, the inner product, the Maron-Kuhns measure, the Parker- Rhodes-Needham measure, and others. Overall, when all recall levels are taken into account, the cosine measure again produces the best results. xiii