IRS13
Scientific Report No. IRS-13 Information Storage and Retrieval
Document Length
chapter
E. M. Keen
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
V-7
and the matching algorithms are held unaltered while, say, abstracts and titles
are compared. Considering the cosine correlation coefficient for just one
document in relation to one request, it is clear that a change from titles
to abstracts will not affect Rw in the equation. Factor Dw will increase
directly with an increase in document length however. Factor Mw will
either increase or remain constant, depending on whether the use of the
abstract compared with title only achieves a match with more of the request
concepts, and/or increases the weights of the concepts that already match
on titles. The resulting difference in [OCRerr]orrelation coefficient between the
title and abstract input cannot be predicted: if the abstract provides
more matching concepts (Mw), and does not increase document length (Dw)
too drastically, the abstract result will give a higher correlation coef-
ficient than the title. If the abstract provides no additional matching
concepts or increased weights, then the correlation with abstracts will be
less than that on titles.
An example of what happens in one particular case is given in
Figure 2. Details of the request and relevant document are given, as well
as portions of the document as looked-up in a thesaurus dictionary using
first the title only, then the whole abstract then the full text. Docu-
ment length sharply increases to 109 concepts with full text over 12 in the
abstract and five in the title. The match between the request and document
starts at two out of the six possible concepts with titles; the use of abstracts
increases the weight of these two matching concepts, and full text increases
the matching concepts to all six, as well as improving weights. However,
the cosine correlation coefficients show that in this example the increases
in document length exert more influence in the coefficient than the increases
in matching concepts, so that the correlation coefficient drops from 0.3651
to 0.3608 with abstracts, and further still to 0.2034 with text.