ISR11
Scientific Report No. ISR-11 Information Storage and Retrieval
Operating Instructions for the SMART Text Processing and Document Retrieval System
chapter
M. E. Lesk
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
11-10
3.2. Specifications Affecting Phrase Searching
Two basic phrase searching methods are available. These are called
the statistical and syntactic phrase searching procedures. Both use
pre-designed dictionaries of phrases which are searched against sentences
in the texts. Statistical phrase searching is the more common of these
methods. The phrase dictionary consists of a set of pre-assigned word
pairs or word n-tuples (where n can be 2,3,[OCRerr],5 or 6). Each word n-tuple
has a "phrase concept number" associated with it, that is used as the
concept n[OCRerr]miber representing the whole phrase (as distinguished from the
individual concept numbers which give the components of the phrase)
Every sentence is scanned for occurrences of phrase components, and when
all of the n phrase components have been found to occur at least m times
in the sentence, the phrase is considered to have occurred m times, and
a weight is entered accordingly for the "phrase concept number". The
method of writing the phrase dictionary on the library tape is given in
part 5.1.2.
It is noted that this is a rather imprecise type of search, since
the words searched for may occur anywhere in the sentence in any order.
For e[OCRerr]le, the sentence "Despite a second, larger order of textbooks,
approximately twenty-five percent of the students are still without them'1
will be considered as containing the phrase "second order approximation't.
A more precise type of search is available through the "syntactic phrase
searching" procedure. This uses the "criterion tree" dictionary (see
5.[OCRerr].4) in which a complete syntactic/semantic/structural specification
of each phrase is given. If a syntactic phrase is to be detected,
1) the components of the phrase must have the proper semantic