IRS13 Scientific Report No. IRS-13 Information Storage and Retrieval Document Length chapter E. M. Keen Harvard University Gerard Salton Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. v-i V. Document Length E. M. Keen 1. Introduction A major advantage in the design of automatic document retrieval systems is the ability to add new documents to the collection without the necessity for an individual manual content analysis. This is done by using the natural language text of the documents as input, together with automatic analysis procedures based on pre-stored dictionaries to achieve vocabulary normali- zation. Such an automatic procedure is not necessarily straightforward how- ever, and various possible alternatives must be considered. This study will deal with the influence of document length as used in a SMART type system. One of the main elements of a manual document a[OCRerr]alysis or indexing procedure that has been in use for many years is the process of term selec- tion, whereby the indexer makes a choice of subject ideas from the document being indexed. This selection process always requires a difficult manage- ment decision because some of the users will benefit from highly exhaustive indexing (the selection of many subject ideas); on the other hand, factors such as cost and search time often limit the indexing process to one of low exhaustivity. As a first approximation, an automatic method using natural language text provides the answer to this problem, since the whole document text can now be used, without any pre-selection activity at all. Although use of full text is possible in theory, in practice, various limitations must be taken into account. For example, there exists the input problem,