Scientific Report No. IRS-13 Information Storage and Retrieval

IRS13 Scientific Report No. IRS-13 Information Storage and Retrieval Word-Word Associations in Document Retrieval Systems chapter M. E. Lesk Harvard University Gerard Salton Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. IX-5 The major procedures used for evaluation in the SMART system are described elsewhere. [3,43 They are the recall-precision curve, and four global measures: rank recall, log precision, normalized recall, and normalized precision. The measures vary from 0 to 1, with 0 representing the worst possible performance and 1 representing perfect performance. These measures all reflect both recall and precision, requiring both perfect recall and perfect precision to produce a measure of 1, but the rank recall and normalized recall measures both reflect recall more than pre- cision, while the log and normalized precision reflect precision more strongly than recall. The "quasi-Cleverdon" recall-precision curves shown here are averaged recall-precision curves over the set of 42 requests. 3. Results Table 1 shows the distribution of association pairs as a function of word frequency, with a cosine correlation at a cutoff of .6. It is seen that the largest number of correlations occur for words of very low frequency, frequencies 1 and 2. With the correlation measure used, it is very easy for low frequency words to co-occur significantly, since, if two words of frequency 1 occur in the same document they will always have a correlation of 1.0. With a collection size of 200 documents, in which 1179 words occur only once, one may expect over 7000 correlations above cutoff of words of frequency 1 with other words of frequency 1 purely on a random basis. If the words of frequency 2 are also considered, the total number of random correlations above .6 would be expected to be about 12000. It is clear therefore that the 18000 correlations observed do not actually