IRS13 Scientific Report No. IRS-13 Information Storage and Retrieval Document Length chapter E. M. Keen Harvard University Gerard Salton Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. v-ic It is important to realize that this discussion of the effect of changes in document length on the correlation coefficient applies to corre- lations with either relevant or non-relevant documents, and a change in cor- relation resulting from a change in rank gives no indication of the change in retrieval performance. Retrieval Performance is always a trade-off between relevant and non-relevant documents, and increases in document length can just as easily worsen performance as improve it, since longer documents produce greater opportunity for incorrect matches with non-relevant docu- ments. An illustration of the effect of change in document length on a non-relevant document is given in Figure 3, where document [OCRerr]l is presented for both title and abstract searches in relation to request QA8, used earlier in Figure 2. Again the cosine correlation shows a decrease with increase in document length, but in this case since the change from titles to abstracts does not even improve the weights in the two matching concepts, a severe drop in correlation takes place. Retrieval performance for this request is given in Figure [OCRerr], considers only documents 61 (relevant, see Figure 2) and 41 (non-relevant). It is seen that document 41 is more highly correlated (and therefore better ranked) than document 61 with titles; a reverse result is ob- tained with the abstracts. In this one example, the increase in document length improves performance, but many individual cases have been observed of the reverse trend'. It seems logical to postulate that, for a given set of search requests, relevance judgments, and document collection, there must exist an optimum document length that gives the best retrieval performance. However, in general, this is too simple a statement, and does not allow for the fact that per- formance requirements in terms of either high recall or high precision may demand different document lengths under different circumstances for optimum