IRS13 Scientific Report No. IRS-13 Information Storage and Retrieval Document Length chapter E. M. Keen Harvard University Gerard Salton Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. V-2 namely the effort and cost associated with the transformation of whole docu- ment texts into machine readable form. Several possible solutions are sug- gested to this problem, such as the development of a universal print reader, or the use of some by-product of the t[OCRerr]esetting Stagee Then there arises the problem of coding and searching documents which contain many mathematical equations, c[OCRerr]plex diagrams, or other essential non-textual material. Then again, for the user, the search response time is likely to be long when full text is stored even with small document col- lections, although faster search procedures maybe possible in the future. Lastly, the use of full text may not serve all users well with regard to retrieval performance, since the requestor may be swamped with many documents that are strictly relevant but rather trivial in relation to the topic of the search request. For these reasons, automated systems of the first-generation will need to consider selections of the document text, rather than the whole text. Many documents contain suitable selections of text made by the author of the document, such as the title itself, or probably better still, an abstract or summary of the paper. Like the product of manual indexing, an abstract or summary of a document is a precis of the document which distills the essen- tial subject ideas into a few hundred words. The presence of bias or slant in both indexing and abstract preparation may not favor the use of natural language input, however, since in such a system there often exists no possi- bility of picking out only those topics of interest to the users of the sys- tem (as is possible in manual indexing[OCRerr] Th addition, for some documents the natural language abstract may be a poorly written pre[OCRerr]is of the docu- ment. [OCRerr]Thien an automatic system using abstracts is implemented it may be neces- sary to make up these deficiencies by manual effort; procedures are also