IRS13
Scientific Report No. IRS-13 Information Storage and Retrieval
Correlation Measures
chapter
K. Reitsma
J. Sagalyn
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
Iv-'
Iv. Correlation Mea[OCRerr]ures
K. Reitsma and J. Sagalyn
Abstract
In this study the performance of ten matching functions is inves-
tigated. The performance is measured in terms of recall and precision.
All ten functions are tested on the 82 document ADI collection; the best four
are tested again on the larger 200 document Cranfield collection. It is
shown that the Parker-Rhodes-Needham function has the best performance in
the ADI collection below 0.50 recall; however, this function is the worst in
the Cranfield collection test. Overall, the Cosine function shows the best
performance.
1. Introduction
A document retrieval system, from a user's point of view, takes
a request for information, in the form of a short verbal description, matches
the request against the documents in the collection and returns those which
by some measure are most relevant.
Within the SMART system, all the documents have been analyzed auto-
matically according to word frequency counts of keywords contained in a
thesaurus. Each analyzed document is represented by a description vector of
concept numbers with corresponding weights (the weight being proportional to
the frequency of occurence of that concept). When a request is received, it