IRS13
Scientific Report No. IRS-13 Information Storage and Retrieval
Document Length
chapter
E. M. Keen
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
v-ic
It is important to realize that this discussion of the effect of
changes in document length on the correlation coefficient applies to corre-
lations with either relevant or non-relevant documents, and a change in cor-
relation resulting from a change in rank gives no indication of the change
in retrieval performance. Retrieval Performance is always a trade-off between
relevant and non-relevant documents, and increases in document length can
just as easily worsen performance as improve it, since longer documents
produce greater opportunity for incorrect matches with non-relevant docu-
ments. An illustration of the effect of change in document length on a
non-relevant document is given in Figure 3, where document [OCRerr]l is presented
for both title and abstract searches in relation to request QA8, used earlier
in Figure 2. Again the cosine correlation shows a decrease with increase
in document length, but in this case since the change from titles to abstracts
does not even improve the weights in the two matching concepts, a severe drop
in correlation takes place. Retrieval performance for this request is given
in Figure [OCRerr], considers only documents 61 (relevant, see Figure 2) and 41
(non-relevant). It is seen that document 41 is more highly correlated (and
therefore better ranked) than document 61 with titles; a reverse result is ob-
tained with the abstracts. In this one example, the increase in document
length improves performance, but many individual cases have been observed
of the reverse trend'.
It seems logical to postulate that, for a given set of search requests,
relevance judgments, and document collection, there must exist an optimum
document length that gives the best retrieval performance. However, in general,
this is too simple a statement, and does not allow for the fact that per-
formance requirements in terms of either high recall or high precision may
demand different document lengths under different circumstances for optimum