ISR11
Scientific Report No. ISR-11 Information Storage and Retrieval
Information Analysis and Dictionary Construction
chapter
G. Salton
M. E. Lesk
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
IV-50
Given such a term-document matri[OCRerr] or graph, it is now possible, by
well-kno[OCRerr]m statistical association methods, to compute similarity
coefficients between terms, based on co-occurrence characteristics of
the terms in the documents of the collection. The similarity coefficient
between each pair of terms can then be made to depend on the frequency
with which the terms are jointly assigned to the documents of a collection.
In Fig. 15, for example, it may be noted that terms T an
1 d T6 are
both assigned to documents D1 and D[OCRerr] (although with differing weights),
while they are both not assigned to documents D and D . As a result,
2 3
the term association process may assign these two terms to a common
thesaurus category.
For the example of Fig. 15 an associative procedure might result in
the formation of three term (thesaurus) groups, consisting respectively
of terms T1 and T6 (because of joint assigument to documents D1
and [OCRerr] ), terms T and of joint assi[OCRerr]'uent to D and
7 T[OCRerr] (because
1
D2 ), and finally terms T2, T and T (because of joint assi[OCRerr]'nent
3 5
3
to D and ). The result of a term association process may then be
displayed as an association map, iii which branches between terms represent
term relations, or, alternatively, thesaurus groupings. An excerpt from
a typical term association map is sho[OCRerr]m in Fig. l6.[)[OCRerr],7,8J The thesaurus
groupings suggested by the map of Fig. 16 can be found by inspection.
B) Semi-Automatic Methods
The methods outlined in the preceding part are based on the assumption
that term co-occurrences in documents, or joint assignment of terms to
documents are indicative of term SL'Thilarity or relatedness. This assumption