IRS13
Scientific Report No. IRS-13 Information Storage and Retrieval
Thesaurus, Phrase and Hierarchy Dictionaries
chapter
E. M. Keen
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
VII-3
Ignoring the semi-automatic "Hastie" ADI Thesaurus-SAl and the Cran-1
Thesaurus-i (made without use of the construction rules), the dictionaries
average 594 concepts each, with 10.1 text words grouped into each concept.
Some sample excerpts from three dictionaries illustrating the
grouping of similar terms in the context of three collections used are
given in Fig. 2. It may be noted that a topic such as "Algebra" or
"Calculate" is grouped only with almost synonymous terms (if any exist)
when these topics are central to the collection in use, but a broader
grouping is used when these topics are more peripheral to the subject
field of the collection. Hyphenated word pairs are normally treated as a
single word and usually put with the group most closely associated; for
example "computing-machine" is put in the group which includes "computer"
rather than the group including "machine". The need to group single words
creates problems of ambiguity that [OCRerr]re only partially solved by putting
such words into more than one group. The word "factor", for example, may
need to be grouped with "coefficient" as well as with "parameter" and
"variable", but an incoming request containing"factor" then maps into
several thesaurus groups, and only a decrease in weight resulting from the
multiple mapping is then available to attempt to minimize the effect of the
unwanted association. Some suggestions for further studies on dictionary
construction are given in part 8.
3. Description of Phrase Dictionaries
Since the thesaurus dictionaries contain single words only, some kind
of phrase processing is a reasonable alternative for dictionary construction.