IRS13
Scientific Report No. IRS-13 Information Storage and Retrieval
Document Length
chapter
E. M. Keen
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
v-18
In the experimental tests, documents with zero correlation are also given rank
positions, (although, very low ones, in order that the normalized measures
may be calculated, and also so that precision/recall curves may be drawn
right up to 1.0 recall.) A statement was made earlier suggesting that short
documents will present a barrier to perfect recall, because such short docu-
ment identifications are likely in some cases completely to miss something
important from the original full text of the document, thus resulting in
a zero match between the search request and such a relevant document. Such
an occurrence will cause recall loss, and the resulting recall ceiling will
obviously be lower for short documents than long ones. The data of Figure 8
give results comparing abstracts and titles in six tests. The average recall
ceiling is computed by accepting only those relevant documents with some
positive correlation with the search requests; the recall ceiling with titles
is seen to go down to .66 in one case. The recall ceiling for both abstracts
and titles would in practice be lower than the values given, since many users
would not be willing to examine all the documents with positive correlations
(this would involve examining in the ADI Abstracts Collection, on average,
70% of the total collection). Comparing the results of Figure 8 with the
data in Figure 1, the greater length of the abstracts and titles on aerody-
namics over Documentation produces very slightly higher recall ceiling results,
but the average length abstracts and titles on computer science give quite
superior recall ceiling results. The conclusion is that for users needing
high recall, titles only will not usually be adequate, and something nearer
abstract length is required.
Since the results presented so far are all averages, and use the
arithmetic means over the request sets, data are given in Figures 9, 10,
11, and 12 that are based on the individual requests and individual relevant