ISR11
Scientific Report No. ISR-11 Information Storage and Retrieval
Information Analysis and Dictionary Construction
chapter
G. Salton
M. E. Lesk
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
IV-21
a) The Phrase Dictionaries
Both the regular as well as the null thesauruses are based on entries
corresponding either to sinnle words or to single word stems. In attempting
to perform a subject analysis of written text, it is possible however, to
go frrther by trying to locate "phrases" consisting of sets of words which
are judged to be important in a given subject area. For example, in the
field of computer science, the concepts of "program" and [OCRerr] may
mean many things to many people. On the other hand, the phrase concept
which results from a combination of these individual words, that is,
1'progremming language" has a much more specific connotation. Such phrases
can be used for subject identification by building phrase dictionaries to
be used in locating combinations of concepts, rather than individual concepts
alone. Such phrase dictionaries would then normally include pairs, or triples,
or quadruples of words or concepts, corresponding in written texts to the
more likely noun and prepositional phrases which may be expected to be
indicative of subject content in a given topic area.
Ma;ny different strategies can be used in the construction of phrase
dictionaries. For example, it is possible to base phrase dictionaries on
c[OCRerr][OCRerr]ibinations of high-frequency words or word stems occurring in documents and
search requests; alternatively, one may want to use a thesaurus before appeal
is made to a phrase dictionary. U[OCRerr]ider those circumstances, the phrase
dictionary would then be based on con[OCRerr]inations of concept categories included
in the thesaurus, rather than on combinations of words.
Furthermore, given the availability of a phrase dictionary one can
recognize the presence of phrases in a given text under a variety of cir-
cumstances: for example, the existence of a phrase may be recognized
whenever the phrase components are present within a given document, regard-