ISR11 Scientific Report No. ISR-11 Information Storage and Retrieval Design Criteria for Automatic Information Systems chapter M. E. Lesk G. Salton Harvard University Gerard Salton Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. V-23 therefore comparable to thesaurus procedures, except that the word associa- tions reflect strictly the vocabulary statistics of a given collection, whereas a thesaur[OCRerr]is grouping may be expected to have a more general validity. [OCRerr]ny possible procedures exist for the generation of statistical word associations, leading to the identification of varying numbers of associated term pairs. Two main paran[OCRerr]eters are the cut-off value K in the association coefficient below which a statistical association is not recognized, and the. frequency of occurrence of the terms being correlated. When all terms are correlated, no matter how low their frequency in the document collection, a great many spurious associations may be found; on the other hand, some correct associations will not be observable[OCRerr]under any stricter conditions. The spurious associations result initially in low precision, but the few important associations will eventually produce improved recall in the high recall region. This is reflected in the curve for the Ivnull concon all'T process (concept-concept associations performed for all word stems regardless of frequency) of Fig. 10. Increasingly more restrictive association procedures, applied first only to concepts in the frequency range 3 to 50, and then in the frequency range 6 to 100 eliminate many spurious associations, but also some correct ones. This results in a smaller initial loss in precision, but also in a poorer recall performance for high values. The output of Fig. 10 then confirms the following general rule: Rule 6 : fleep indexing procedures which supply new information identifiers of which some are useful but many are not usually improve recall but depress precision. Fig. 11 exhibits the comparison between word-word association procedures