ISR11 Scientific Report No. ISR-11 Information Storage and Retrieval Information Analysis and Dictionary Construction chapter G. Salton M. E. Lesk Harvard University Gerard Salton Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. IV-3 must often be eliminated (but some words, such as can?? may occur both as significant and non-significant words); 2) many distinct words may be used to supply the same or related meanings; such synonymous words or expressions must be recognized if an accurate content analysis of documents and search requests is to be undertaken; 3) many words can be used in several different senses depending on the context (for example, a word like ??base?l may variously represent military bases, lamp bases, bases in baseball, and so on); it is important to identi[OCRerr][OCRerr] such homographs, and if possible to recognize the proper meaning in a given context; many types of syntactic equivalences occur in the language, where completely different constructions are used to represent the same general idea; as an extension of the overall synonym problem, it is important to recognize at least the principal types of syntactic paraphrasing; 5) the use of indirect references is prevalent in the natural language, where pronouns, collective names, and other [OCRerr]articles are used to refer to entities presumably known by the context; the identification of the proper antecedents of such pronouns is difficult, particularly for cases where many different words can operate as antecedents; 6) relations may exist between words which are not explicitly contained in the text, but which can be deduced from the context, or from other texts previously an&'yzed; the identification of such relations requires deductive capabilities of considerable power; 7) the meaning of many words may change with time, or contrariwise, new words may be created to refer to entities previously referred to in different terms (for example, the unit of time previously known as ??millimicrosec end?? is now generally known as ?? nanosecond??). If the natural language is used as primary input to an information