ISR11
Scientific Report No. ISR-11 Information Storage and Retrieval
Design Criteria for Automatic Information Systems
chapter
M. E. Lesk
G. Salton
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
V-12
A second principal difference between manual and automatic infor-
mation analysis systems is the relative difficulty in manual systems of
discrimlnating among keywords by weights assigned to reflect their
relative importance. This results in the "all or nothing" situation
where a given identifier is either present or not, and each identifier is
considered to be equally important. In an aut[OCRerr]natic system, on the other
hand, it is easy to assign weights to individual identifiers, as shown in
Fig. 2. These weights can be derived in part by using the frequency of
occurrence of the original text words, and in part as a function of the
various dictionary mapping procedures. Thus, ambiguous terms which in a
synonym dictionary correspond to many different concept classes, can be
weighted less than unambiguous terms.
The relative usefulness of analyzing document sections of varying
lengths, and of utilizing weighted terms is reflected in the output of Figs.
[OCRerr] and 6. These recall-precision graphs exhibit output averaged over 17
search requests for the IRE - 2 collection and over 35 requests for the ADI
material. Since it is in general desirable to get both high recall (that
is, to retrieve most of what is relevant) and high precision (that is, to
retrieve very little that is irrelevant), the region of importance is
the upper right-hand corner of each graph. The more effective a given
retrieval algorithm, the smaller will be the distance between the correspon-
ding recall-precision curve and the 1:1 recall-precision point.
Fig. 5(a) shows a comparison of a "title only" option, where only
the titles of documents are used in the analysis [OCRerr]dth a "full abstract"
option. In both cases, the word stems originally extracted from document
titles and document abstracts were first looked-up in a synonym dictionary