Scientific Report No. IRS-13 Information Storage and Retrieval

IRS13 Scientific Report No. IRS-13 Information Storage and Retrieval Document Length chapter E. M. Keen Harvard University Gerard Salton Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. v-18 In the experimental tests, documents with zero correlation are also given rank positions, (although, very low ones, in order that the normalized measures may be calculated, and also so that precision/recall curves may be drawn right up to 1.0 recall.) A statement was made earlier suggesting that short documents will present a barrier to perfect recall, because such short docu- ment identifications are likely in some cases completely to miss something important from the original full text of the document, thus resulting in a zero match between the search request and such a relevant document. Such an occurrence will cause recall loss, and the resulting recall ceiling will obviously be lower for short documents than long ones. The data of Figure 8 give results comparing abstracts and titles in six tests. The average recall ceiling is computed by accepting only those relevant documents with some positive correlation with the search requests; the recall ceiling with titles is seen to go down to .66 in one case. The recall ceiling for both abstracts and titles would in practice be lower than the values given, since many users would not be willing to examine all the documents with positive correlations (this would involve examining in the ADI Abstracts Collection, on average, 70% of the total collection). Comparing the results of Figure 8 with the data in Figure 1, the greater length of the abstracts and titles on aerody- namics over Documentation produces very slightly higher recall ceiling results, but the average length abstracts and titles on computer science give quite superior recall ceiling results. The conclusion is that for users needing high recall, titles only will not usually be adequate, and something nearer abstract length is required. Since the results presented so far are all averages, and use the arithmetic means over the request sets, data are given in Figures 9, 10, 11, and 12 that are based on the individual requests and individual relevant