IRS13
Scientific Report No. IRS-13 Information Storage and Retrieval
Document Length
chapter
E. M. Keen
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
v-6
It may be expected that, commencing with documents short in length,
any increase in length will increase the nun[OCRerr]ber of concepts that match between
the requests and documents. In the type of test environment used by SMAI[OCRerr]T,
namely a siuulated real-life situation using requests and relevance judgments
that are inevitably subjective in nature, it is quite rare for any short
length documents to completely match with all the request concepts. In
cases where a complete match does occur, it is natural[OCRerr] not necessary to
increase the document length to improve the request/document match, except
that in the numeric vectors scheme, the matching concepts are often increased
in the longer documents.
The effect of the use of the cosine correlation with numeric vectors
is complex, because this matching scheme includes the length of both the
request and document, as well as the matching concepts in the algorithm,
as follows:
Cosine Correlation Coefficient =
= The concepts that Match between a Request and a
Document, using the sums of products of the weights
assigned to the matching concepts;
Rw = The total concepts in the Request, using the sums of
the squares of the weights assigned to the concepts;
Dw = The total concepts in the Document, using the sums
of the squares of the weights assigned to the concepts.
where MW
The resulting coefficient is obtained for each request in relation to every
document in the collection, so that the output of the search may be an ordered
list of documents. In tests investigating document length, all other van-
ables such as the request set, the document collection, the word dictionary