ISR11
Scientific Report No. ISR-11 Information Storage and Retrieval
Design Criteria for Automatic Information Systems
chapter
M. E. Lesk
G. Salton
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
V-20
to identify documents and search requests, rather than only individual
concepts alone. Thus if a given document contains the notion of "program"
and the notion of "language", it might be tagged with the phrase "programing
l[OCRerr][OCRerr]age". Phrases can be generated using a variety of strategies: for
example, a phrase can be assigned any time the specified components
co-occur in a given document, or in a given sentence of a document; alter-
natively, more restrictive phrase generation methods can be used by incor-
porating into the phrase generation process a syntactic recognition routine
to chec[OCRerr] the syntactic compatibility between the phrase components before
a phrase is actually accepted. [i[OCRerr]]
In the SMART system, the normal phrase process uses a preconstructed
dictionary of important phrases, and simple co-occurrence of phrase compo-
nents, rather than syntactic criteria, are used to assign phrases to
documents.* Phrases seem to be particularly useful as a means of incorpora-
ting into a document representation, terms whose individual components are
not always meaningful by themselves. For example, 1'computert' and "control"
are reasonably nonspecific, while `1computer control" has a much more
definite meaning in a computer science collection.
The output of Fig. 9 shows that phrases tend to improve recall at
some expense in initial precision. This same effect was previously noted
when the abstract processing was compared with full text in Fig. 5(b); it
results from the fact that the simple process is good enough to retrieve
the first few relevant documents (that is, in the high precision region),
while the more sophisticated procedure is important if additional relevant
documents are also wanted (that is, for high recall).
* Syntactic methods have, however, been used experimentally and sample
results are published elsewhere. [6]