IRS13
Scientific Report No. IRS-13 Information Storage and Retrieval
Document Length
chapter
E. M. Keen
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
compared to the manual indexing available for that collection. This comparison
is made because the indexing takes up about half the length of the abstracts,
and constitutes a valid comparison because of the unusual nature of the
indexing, which is [OCRerr] a base list of words, selected directly from the
title and text of a document ... presented without any reference whatsoever
to a control list for synonyms, related terms, etc.'[OCRerr] [1, page [OCRerr]l, see also
pages [OCRerr]8, 52]. The controls used in indexing permitted the confounding of
singular and plural word forms, as well as variant spellings, but the index
terms were otherwise culled from the documents in natural language. The
indexing used is then, in effect, another abstract of the documents, shorter
in length than the author abstract, and produced by trained indexers It
is expected that the choice of subject ideas from the whole document by the
indexers will be very similar on average to the choice of ideas made by the
abstractors, although the area of overlap has not been determined.
Retrieval runs of the above comparisons are presented using the stem
and thesaurus dictionaries and all results use the cosine correlation and
numeric vectors, unless otherwise stated.
The comparative lengths of the documents in these comparisons are
given in Figure 1. Although the lengths given in the figure are based on
the concepts resulting from the documents being looked-up in the suffix
is t dictionary, relative lengths will remain the same using the stem and
thesaurus dictionaries.
3. Effect of Changes in Document Length
In this part, the effect of changes in document length on the match
between requests and documents is considered, followed by the expected differ-
ences in retrieval performance.