ISR11
Scientific Report No. ISR-11 Information Storage and Retrieval
Information Analysis and Dictionary Construction
chapter
G. Salton
M. E. Lesk
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
Iv-63
It has been remarked in this connection, that when words, or word-
uses, of unequal frequency are included in a thesaurus, or represented
on an association map of the type shown in Fig. 16, a hierarchical
arrangement results almost inevitably, since frequent words can be made
into categories, and words of lesser frequency into subcategories. [[OCRerr]]
Hierarchical association maps have in fact been constructed, using the
frequency characteristics of the words as a criterion.L15] In any case,
no matter what procedure is actually adopted, it would seem that a useful
hierarchy which places general concepts near the top of the tree, and
specific ones near the bottom, must exhibit the expected frequency
characteristics which generally hold between broad and specific terms.
Since the construction of a complete hierarchy without any guidelines
is at the least a thankless task, and at worst an impossible one, methods
imist be investigated to generate hierarchical arrangements semi-automatically.
Three different procedures are outlined, all of which are based on a term-
property matrix of the type shown in Fig. 18, or a term-document matrix
as shown in Fig. 15 (a).
The first process directly uses the questions also used for thesaurus
construction, and breaks down the initial vocabulary as a function of the
responses elicited. An initial question is asked first, and classes of
word-uses are formed based on the responses to this question; the next
question is then applied to each of the resulting word classes which are
thereby broken down again, and so on, until the subdivision is sufficiently
fine.
The process is applied to the vocabulary of Fig. 19 (a) in conjunction
with the questipns of Fig. 19 (b). The resulting hierarchy is shown in
Fig. 20, which shows the word-use frequency attached to each node.