IRS13
Scientific Report No. IRS-13 Information Storage and Retrieval
Word-Word Associations in Document Retrieval Systems
chapter
M. E. Lesk
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
IX-5
The major procedures used for evaluation in the SMART system
are described elsewhere. [3,43 They are the recall-precision curve, and
four global measures: rank recall, log precision, normalized recall, and
normalized precision. The measures vary from 0 to 1, with 0 representing
the worst possible performance and 1 representing perfect performance.
These measures all reflect both recall and precision, requiring both
perfect recall and perfect precision to produce a measure of 1, but the
rank recall and normalized recall measures both reflect recall more than pre-
cision, while the log and normalized precision reflect precision more
strongly than recall. The "quasi-Cleverdon" recall-precision curves shown
here are averaged recall-precision curves over the set of 42 requests.
3. Results
Table 1 shows the distribution of association pairs as a function
of word frequency, with a cosine correlation at a cutoff of .6. It is
seen that the largest number of correlations occur for words of very low
frequency, frequencies 1 and 2. With the correlation measure used, it is
very easy for low frequency words to co-occur significantly, since, if two
words of frequency 1 occur in the same document they will always have a
correlation of 1.0. With a collection size of 200 documents, in which
1179 words occur only once, one may expect over 7000 correlations above
cutoff of words of frequency 1 with other words of frequency 1 purely on
a random basis. If the words of frequency 2 are also considered, the total
number of random correlations above .6 would be expected to be about 12000.
It is clear therefore that the 18000 correlations observed do not actually