IRS13
Scientific Report No. IRS-13 Information Storage and Retrieval
Document Length
chapter
E. M. Keen
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
V-33
Figures 21 and 22 show, respectively, the number of requests favoring
abstract and text using two dictionaries, and magnitude difference plots
for the stem dictionary, since stem favors abstracts more than text in Figure
21, using normalized precision.
The differences between text and abstract are always small, and
usually in favor of text. The precision/recall curves for titles only are
added to those abstract and text in Figure 23; the data on individual requests
in Figure 2[OCRerr] comparing the three document lengths again shows the expected
order of merit. Data for the 170 relevant documents concerned are given in
Figure 25. Taking results of the six possible orders of merit for the three
document lengths, it is interesting to note that merit orders `1A' and "F11
are observed for more documents than any of the other orders of merit.
Documents in A are clearly matched poorly with the request using titles, and
the two increases in length improve the match and rank positions of the [OCRerr]7
documents concerned. Documents in F probably match the requests quite well
on titles, and increases in document length only serve to increase the matches
with non-relevant documents, thus worsening the ranks of these 36 relevant
documents. The abstracts came off worse by this evaluation, but text is
best for many relevantdocuments.
Retrieval runs using full text were also made without the abstracts,
although the title was always included. In the results presented here text
includes abstract, and this change does provide a slight improvement in
performance as the normalized measures in Figure 26 show.
Despite this outcome, the ADI abstracts are thought to be rather poor;
some are rather short, and do not seem adequately to cover the text for docu-
ment retrieval purposes. It is suggested that if better abstracts were
available they might have a superior performance (apart from recall ceiling)
to full text.