ISR11
Scientific Report No. ISR-11 Information Storage and Retrieval
Design Criteria for Automatic Information Systems
chapter
M. E. Lesk
G. Salton
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
V-is
(called `1Harris 3:1 in Fig. 5), to provide vocabulaz[OCRerr]r normalization before
the actual word matching operation. The curve of Fig. 5(a) makes clear
how superior the full abstract process is compared with the title procedure.
If the text words had been matched directly, without a thesaurus intermediary,
the discrepancy between the two procedures would be even larger.
The output of Fig. 5(b) shows that a further improvement is obtainable
if full text is used, rather than oniy abstracts, particularly for the high
recall region. However, the improvement is much smaller here, and in actual
practice it would seem that the additional problems arising from a full
text process can be avoided by restricting the procedure to abstracts and
summaries, unless a clear requirement exists for a high recall performance.
The output of Fig. 5 then leads to the following nile:
Rule 1 : The use of document titles alone for purposes of
information analysis results in poor retrieval
performance compared with the use of abstracts or
full text.
Rule 1 is of particular interest because of the widespread advocacy of
permuted title indexes (also known as KWIC indexes) for information search
and retrieval purposes.
Fig. 6 shows the improvement obtainable by using weighted word stems,
compared with unweighted stems. It is clear from the figure that term
weights are essential for retrieval purposes, and it can be inferred that
one of the main drawbacks of presently operating keyword search systems is
the lack of discrimination between terms of varying importance. Rule 2 can
then be stated as follows: