Scientific Report No. IRS-13 Information Storage and Retrieval

IRS13 Scientific Report No. IRS-13 Information Storage and Retrieval Thesaurus, Phrase and Hierarchy Dictionaries chapter E. M. Keen Harvard University Gerard Salton Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. vII-2 the quasi-synonym list used in the Aslib Cranfield ?roject (13. 4. CRAN-l Thesaurus-2. Known also as the "New Quasi-Synonym" dictionary. This dictionary was constructed by rearranging the word groups and incorporating additional words into the old quasi-synonym dictionary, using five specified rules for dictionary construction (2]. 5. CRAN-l Thesaurus-3. Known also as the "Revised New Quasi- Synonym" dictionary, this revision was made primarily to permit processing of the larger CRAN-2 collection, and in- volved also some small changes in grouping of the words. 6. ADI Thesaurus-l. Known also as a "regular thesaurus", this handmade dictionary was constructed for use with the full text ADI collection. 7. ADI Thesaurus-SAl. Known also as the "Hastie" dictionary, this represents an attempt to use the semi-automatic pro- cedures suggested in [23. Some discussion of the construction expertise that has been gained by experience is contained in a number of previous reports. (2,3,4,5,6,7,8] Synonyms and other less closely related words are grouped subjectively in the case of manually constructed dictionaries, and the effectiveness of a particular dictionary can be determined by comparing the resulting re- trieval performanc[OCRerr] for a set of search requests with the performance ob- tained with a stem dictionary. The main objective data that can be derived from a thesaurus construction algorithm is the amount of word grouping, measured by the average number of distinct natural language text words that are grouped into a thesaurus concept, and also the amount of overlap or ambiguity, measured by the number of words that appear in more than one concept group. This data is given for the seven dictionaries in Fig. 1.