ISR11
Scientific Report No. ISR-11 Information Storage and Retrieval
Information Analysis and Dictionary Construction
chapter
G. Salton
M. E. Lesk
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
IV-3
must often be eliminated (but some words, such as can?? may
occur both as significant and non-significant words);
2) many distinct words may be used to supply the same or related
meanings; such synonymous words or expressions must be recognized
if an accurate content analysis of documents and search requests
is to be undertaken;
3) many words can be used in several different senses depending
on the context (for example, a word like ??base?l may variously
represent military bases, lamp bases, bases in baseball, and
so on); it is important to identi[OCRerr][OCRerr] such homographs, and if
possible to recognize the proper meaning in a given context;
many types of syntactic equivalences occur in the language,
where completely different constructions are used to represent
the same general idea; as an extension of the overall synonym
problem, it is important to recognize at least the principal
types of syntactic paraphrasing;
5) the use of indirect references is prevalent in the natural
language, where pronouns, collective names, and other [OCRerr]articles
are used to refer to entities presumably known by the context;
the identification of the proper antecedents of such pronouns
is difficult, particularly for cases where many different
words can operate as antecedents;
6) relations may exist between words which are not explicitly
contained in the text, but which can be deduced from the context,
or from other texts previously an&'yzed; the identification of
such relations requires deductive capabilities of considerable
power;
7)
the meaning of many words may change with time, or contrariwise,
new words may be created to refer to entities previously referred
to in different terms (for example, the unit of time previously
known as ??millimicrosec end?? is now generally known as ?? nanosecond??).
If the natural language is used as primary input to an information