ISR11
Scientific Report No. ISR-11 Information Storage and Retrieval
Information Analysis and Dictionary Construction
chapter
G. Salton
M. E. Lesk
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
IV-7
On the negative side, dictionaries are often aifficult to construct,
particularly if the environment within which they are expected to operate
is subject to change; furthermore most dictionaries are useless unless
their mode of usage is consistent for all operations. Obviously if a
dictionary is used in one [OCRerr]Tay for information classification and in
another for information searching, an effective result cannot be guaranteed.
Various thesaurus types are examined in more detail in the next few
paragraphs.
3. Dictionary Construction
A) The Synonym Dictionary (Thesaurus)
As previously explained, a thesaurus is a grouping of words, or word
stems, into ceftain subject categories, hereafter called concept classes.
A typical example is sho[OCRerr][OCRerr] in Fig. 1, where the concept classes are
represented by three-digit numbers, and the individual entries are sho[OCRerr]ni
under each concept number. In Fig. 2, a similar thesaurus arrangement is
shown in alphabetical order of the words included. The concept nuiibers
appear in the middle column of Fig. 2 (concept numbers over 32,000 are
attached to `tcommon'1 words which are not accepted as information identi-
fiers); the last column consists of one or more three-digit syntax codes
attached to the words to be used for purposes of syntactic analysis.
When constructing a thesaurus to be used for vocabulary normalization,
one immediately faces three types of problems: first what words should
one include in the thesaurus; secondly, what type of synonym categories
should one use (that is, should one aim for broad, inclusive concept
classes, or should the classes be narrow and specific); finally, where