ISR11
Scientific Report No. ISR-11 Information Storage and Retrieval
Information Analysis and Dictionary Construction
chapter
G. Salton
M. E. Lesk
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
IV-2
dictionaries may also exist, containing terms or categories which should not
be used for purposes of information identification.
In view of the importance of the initial information analysis and
classification - all later search and. retrieval operations are of course
of no avail in the absence of a careful and consistent determination of
information content - it is appropriate to examine in detail the probl[OCRerr]ms
connected with the generation and use of dictionaries. Accordingly, the
present study specifies the form of a variety of dictionaries which have been
found useful in information analysis, and examines some of the principles
of dictionary construction. Im[OCRerr]phasis is placed on those dictionaries which
can be used for natural lan[OCRerr][OCRerr]ge analysis, since many of the information
items and of the search requests to be stored may be expected to be expressed
by words or word strings in the natural language. Performance characteristics
are given, based on search results obtained with various dictionaries, and
several methods are suggested for the constr'[OCRerr]ction of dictionaries by semi-
automatic means.
2. Language Analysis
Consider the problem of taking a document or search request in the
natural language, and of attempting to use some automatic procedure to
generate content identifications for the input texts. Such a task
immediately raises many difficulties brought about by the complexity of
the language, and by the irregularities which govern the syntactic and
semantic structure. The following principal problems must be dealt with [1]:
1) words which carry out syntactic functions but which do not
contribute directly to the specification of information content