IRS13
Scientific Report No. IRS-13 Information Storage and Retrieval
Document Length
chapter
E. M. Keen
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
v-i
V. Document Length
E. M. Keen
1. Introduction
A major advantage in the design of automatic document retrieval systems
is the ability to add new documents to the collection without the necessity
for an individual manual content analysis. This is done by using the natural
language text of the documents as input, together with automatic analysis
procedures based on pre-stored dictionaries to achieve vocabulary normali-
zation. Such an automatic procedure is not necessarily straightforward how-
ever, and various possible alternatives must be considered. This study will
deal with the influence of document length as used in a SMART type system.
One of the main elements of a manual document a[OCRerr]alysis or indexing
procedure that has been in use for many years is the process of term selec-
tion, whereby the indexer makes a choice of subject ideas from the document
being indexed. This selection process always requires a difficult manage-
ment decision because some of the users will benefit from highly exhaustive
indexing (the selection of many subject ideas); on the other hand, factors
such as cost and search time often limit the indexing process to one of
low exhaustivity. As a first approximation, an automatic method using natural
language text provides the answer to this problem, since the whole document
text can now be used, without any pre-selection activity at all. Although
use of full text is possible in theory, in practice, various limitations
must be taken into account. For example, there exists the input problem,