IRS13
Scientific Report No. IRS-13 Information Storage and Retrieval
Thesaurus, Phrase and Hierarchy Dictionaries
chapter
E. M. Keen
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
VI 1-6
Descriptions of the methods used by SMART have previously appeared in
[2,3,5,6,8,9,10,11]. No studies have yet been made of full-scale phrase
recognition, and the "statistical phrase" technique used is intended only
to remove cases of single word ambiguity. For example, a hypothetical
medical request on "swine fever in New Guinea will be quite strongly
matched, using a thesaurus, with a document dealing with "diseases of
the guinea pig". The use of a phrase dictionary containing "New Guinea"
would give strong weight to the occurrence of both "New" and "Guinea"
in a sentence, and thus the spurious match with "Guinea" in the sense
of "guinea pig" would receive less weight by comparison.
The phrase dictionaries tested are handmade, and are based on the
thesaurus groups. Phrase recognition takes place if the two or more
component words (thesaurus concept numbers) appear in the same sentence;
no specific word order position or syntactical relation is demanded.
Phrases are used in retrieval as an addition to the thesaurus dictionary;
thus, when a phrase occurs, a new concept identifier is added to the
thesaurus concepts already assigned to the request or document, or the
weight of an existing concept identifier is increased.
These procedures may be clarified by the excerpt from a thesaurus
and phrase dictionary given in Fig. 3. The phrase made up from the
thesaurus groups containing "axial" and "symmetry" is of value because
the word axial" is more commonly to be found in conjunction with "com-
pressor"; thus, without phrase processing, any document dealing with
"axial compressors" that also contains a concept identifier such as
"regular" or "uniform" could be matched with a request for "axial symmetry".
The addition of phrase processing in this example does not prevent such a