IRS13
Scientific Report No. IRS-13 Information Storage and Retrieval
Thesaurus, Phrase and Hierarchy Dictionaries
chapter
E. M. Keen
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
vII-2
the quasi-synonym list used in the Aslib Cranfield ?roject (13.
4. CRAN-l Thesaurus-2. Known also as the "New Quasi-Synonym"
dictionary. This dictionary was constructed by rearranging
the word groups and incorporating additional words into the
old quasi-synonym dictionary, using five specified rules for
dictionary construction (2].
5. CRAN-l Thesaurus-3. Known also as the "Revised New Quasi-
Synonym" dictionary, this revision was made primarily to
permit processing of the larger CRAN-2 collection, and in-
volved also some small changes in grouping of the words.
6. ADI Thesaurus-l. Known also as a "regular thesaurus", this
handmade dictionary was constructed for use with the full
text ADI collection.
7. ADI Thesaurus-SAl. Known also as the "Hastie" dictionary,
this represents an attempt to use the semi-automatic pro-
cedures suggested in [23.
Some discussion of the construction expertise that has been gained
by experience is contained in a number of previous reports. (2,3,4,5,6,7,8]
Synonyms and other less closely related words are grouped subjectively in
the case of manually constructed dictionaries, and the effectiveness of a
particular dictionary can be determined by comparing the resulting re-
trieval performanc[OCRerr] for a set of search requests with the performance ob-
tained with a stem dictionary. The main objective data that can be derived
from a thesaurus construction algorithm is the amount of word grouping,
measured by the average number of distinct natural language text words that
are grouped into a thesaurus concept, and also the amount of overlap or
ambiguity, measured by the number of words that appear in more than one
concept group. This data is given for the seven dictionaries in Fig. 1.