ISR11
Scientific Report No. ISR-11 Information Storage and Retrieval
Information Analysis and Dictionary Construction
chapter
G. Salton
M. E. Lesk
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
IV-5
a word and a multiplicity of concepts, since the former presumably
implies a unique meaning for that word while the latter implies a[OCRerr]rLbiguity.
Even if almost all terms used in a given context are inherently
ambiguous, the juxtaposition of many multiple mappings can often identify
the appropriate concept classes with reasonable accuracy. The relevant
categories will normally be reinforced, since they apply to many terms,
while the extraneous categories will be randomly distributed.
Consider, for example, the set of terms: [OCRerr]tbase't, ttbat", [OCRerr]
tThit'1. Each term is ambiguous, and a given multiple thesaurus mapping
may specify the correspondences shown in Table I. In that table, three
categories are shown for the word "base1T, and two categories for each of
the other terms. Despite the apparent ambiguities, a document identified
by the four original terms can nevertheless be assigned to the "baseball't
class with reasonable expectation of success, since the other categories
occur more or less at random for the given terms, whereas the "baseball"
class is always present.
The principal advantages of synonym and phrase dictionaries for
purposes of content identification may then be summarized as follows:
1) they permit a consistent assignment of concept classes to items
of information thereby replacing either keywords and index
terms assigned to documents and search requests, or the words
occurring in them;
2) they can often be used to resolve ambiguities by looking at
the pattern of occurrence of the concepts;
3) they can serve for the analysis of many different subject
fields and for different types of usage, since it is possible
to adapt the dictionary to the particular search environment.