ISR11
Scientific Report No. ISR-11 Information Storage and Retrieval
Information Analysis and Dictionary Construction
chapter
G. Salton
M. E. Lesk
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
iv-iS
incli[OCRerr]ed in any of the dictionaries. ixperiments were conducted [OCRerr]qith the
S[OCRerr][OCRerr]4[OCRerr]T system, using both unrestricted vocabularies (fLJl null thesaurus),
as well as frequency restricted entries (partial null). A sample set of
document abstracts of some 50,000 total running words, would typically
produce a full null thesaurus of about 2,800 distinct word stems, and a
partial null dictionary of about 900 stems (assuming a frequency of at least
four occurrences for each entry listed).
If it is desired to list word stems, rather than full words, these
must of course first be generated by a suffix cut-off system. To this
effect, a suffix dictionary is built, a typical example of which is shown
in Fig. [OCRerr]. The lookup procedure in this suffix dictionary is described
in the next chaptertogether with the lookup procedures for the other
dictionaries. The structure of the suffix dictionary may, however, be
examined iir'[OCRerr]ediately. It may be seen from Fig. 1* that each suffix is listed
with a sequence number and with one or more syntactic codes. The latter rp[OCRerr]r
be used if it later becomes necessary to recombine stems and suffixes into
complete, acceptable words, as may be required, for example, to carry out
a syntactic analysis.
The syntactic codes included in the suffix dictionary represent only
partial homographs which must be combined with complementing codes attached
to the word stems in order to determine which suffixes match which stems.
(The syntactic codes attached to the word stems included in the null thesaurus
are not shown in the output of Fig. 3.) For example, a partial homograph
such as OTIO from the null dictionary will combine with a partial homograph
code from the siiffix list, such as VOOSO, to form a complete homograph. In
this case the complete code is VTISO, indicating a single object transitive
verb in the third person singular.