ISR11
Scientific Report No. ISR-11 Information Storage and Retrieval
Information Analysis and Dictionary Construction
chapter
G. Salton
M. E. Lesk
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
Iv-57
for a total frequency of n/N, assuming that classes of approximately
equal frequency are wanted. The process of generating N classes from
P initial property sets may now be carried out as follows:
1) a PX M word-use versus property matrix (similar to that
shown in Fig. 18) is constructed;
2)
the property vectors are sorted into numeric order, and the
set of P property vectors is reduced to only the distinct
property vectors, say Q1 < P;
3) since each of the distinct vectors is to account for a
word-use frequency of n/N, each vector is examined to see
whether the total frequency represented by that vector is
approximately n/N;
[OCRerr]) if a given concept vector occurs [OCRerr]Tith a frequency smaller than
n/N, it represents too small a class and should be combined
with other vectors; this is done by deleting a sufficient
number of questions (columns of the property matrix) to obtain
a resulting combined concept class of frequency approximately
equal to n/N; let the number of distinct property vectors
which result be equal to < Q1;
5) some property vectors account for too large a frequency count,
and ought to be bro[OCRerr]en up by using the concordance to formulate
additional questions [OCRerr]o as to resolve finer shades of meaning;
this eventually produces distinct vectors (% >
6) by alternately using the procedures of parts [OCRerr]) and 5), the
frequency count of each of = N vectors eventually may
approach n/N, at which point the process terminates.
Consider, as an example, the list of word-uses shown in Fig. 19 (a),
accounting for a total frequency count of 2198 word instances, and
assume that it is desired to form a thesaurus with 5 concept classes.
Each concept vector should then cover approximately 2200/5 = 4[OCRerr]o word