IRS13
Scientific Report No. IRS-13 Information Storage and Retrieval
Document Length
chapter
E. M. Keen
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
V-51
documents, but the abstracts give a much better performance. The ADI titles
are short also, but analysis has shown that in cases where the title does not
give a good performance, a search of the full text frequently results in a
poor performance for these cases also. The cause of this is probably a com-
bination of the relevance decisions used, and synonym recognition problems,
since the subject terminology in documentation is thought to be less precise
than in the other collections.
Where requestor needs are covered by whole documents treating the
topic of the request, titles alone may frequently be adequate, and K[OCRerr]IC
title indexes have proved to be useful tools for such needs. It was noted
that in a subset of the requests used in the Cranfield Project tests, 31%
of the relevant document titles in a set of 35 requests had a strong match
with the search request [2, pages 36-39]. But requestor needs are not always
for whole documents, since relevant portions of a document frequently answer
a need as completely as a whole document. In these cases, titles are quite
inadequate, and a more exhaustive selection from the text is essential for
good retrieval.
Four examples are given comparing abstracts to full text using
the ADI Collection, in Figures 35 to 39. The request statements are given,
together with the words matching the documents, with matching aided by a the-
saurus dictionary. Figure 35 [OCRerr]hows a case where the increased matching
on full text improves performance, whereas Figure 36 shows how an increased
matching on full text can worsen performance. Figure 37 shows a case where
matching and performance were unchanged by use af full text, since the weight
of the important term "journals" was increased from 14 to 30 on text from
abstracts. Figure 38 shows a case where the increased weights provided by
full text fail to prevent a non-relevant document from receiving a rank