ISR11
Scientific Report No. ISR-11 Information Storage and Retrieval
Design Criteria for Automatic Information Systems
chapter
M. E. Lesk
G. Salton
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
V-23
therefore comparable to thesaurus procedures, except that the word associa-
tions reflect strictly the vocabulary statistics of a given collection,
whereas a thesaur[OCRerr]is grouping may be expected to have a more general validity.
[OCRerr]ny possible procedures exist for the generation of statistical word
associations, leading to the identification of varying numbers of associated
term pairs. Two main paran[OCRerr]eters are the cut-off value K in the association
coefficient below which a statistical association is not recognized, and the.
frequency of occurrence of the terms being correlated. When all terms are
correlated, no matter how low their frequency in the document collection,
a great many spurious associations may be found; on the other hand, some
correct associations will not be observable[OCRerr]under any stricter conditions.
The spurious associations result initially in low precision, but the few
important associations will eventually produce improved recall in the high
recall region. This is reflected in the curve for the Ivnull concon all'T
process (concept-concept associations performed for all word stems regardless
of frequency) of Fig. 10.
Increasingly more restrictive association procedures, applied first
only to concepts in the frequency range 3 to 50, and then in the frequency
range 6 to 100 eliminate many spurious associations, but also some correct
ones. This results in a smaller initial loss in precision, but also in a
poorer recall performance for high values. The output of Fig. 10 then
confirms the following general rule:
Rule 6 : fleep indexing procedures which supply new information
identifiers of which some are useful but many are not
usually improve recall but depress precision.
Fig. 11 exhibits the comparison between word-word association procedures