ISR11
Scientific Report No. ISR-11 Information Storage and Retrieval
Information Analysis and Dictionary Construction
chapter
G. Salton
M. E. Lesk
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
Iv-56
grouped by judiciously elIminating certain properties (certain columns
in the property matrix) alternatively, terms which have already been
grouped may be split apart by introducing new properties to differentiate
them. For example, if property P3 is removed from the property matrix
of Fig. 18, terms T and T will both exhibit the same set of
3
assigned properties (although vTith 4iffering weights), and may there-
fore be grouped. Similarly, the removal of property P results in the
I
grouping of terms T and T[OCRerr].
In practice, it may be useful to consider more formal methods first
for co[OCRerr]parin.[OCRerr] the rows of t[OCRerr]e property matrix (that is, for computing a
similarity coefficient between each pair of terms), and then for generating
term clusters.
a) Sample Thesaurus Generation
The procedure previously outlined may now be summarized as follows:
automatic methods are used to prepare a word frequency list, as well as
a concordance, for the principal words included in a sample document
collection. A decision is then made concerning the number of word-uses to
be included in the thesaurus for each distinct word, and discriminating
questions are prepared to serve for purposes of word classification. The
property lists which result are then compared, and word-uses which are
identified by similar property lists are assigned to the same thesaurus
category. E9,l1[OCRerr]]
Consider a typical example in which P non-common word-uses are
identified by M properties, and it is desired to create a thesaurus
with N concept groups.[14] If the P word-uses have a total frequency
of occurrence of n in the collection, each thesaurus class should account