IRS13
Scientific Report No. IRS-13 Information Storage and Retrieval
Document Length
chapter
E. M. Keen
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
V-2
namely the effort and cost associated with the transformation of whole docu-
ment texts into machine readable form. Several possible solutions are sug-
gested to this problem, such as the development of a universal print reader,
or the use of some by-product of the t[OCRerr]esetting Stagee
Then there arises the problem of coding and searching documents which
contain many mathematical equations, c[OCRerr]plex diagrams, or other essential
non-textual material. Then again, for the user, the search response time
is likely to be long when full text is stored even with small document col-
lections, although faster search procedures maybe possible in the future.
Lastly, the use of full text may not serve all users well with regard to
retrieval performance, since the requestor may be swamped with many documents
that are strictly relevant but rather trivial in relation to the topic of
the search request.
For these reasons, automated systems of the first-generation will
need to consider selections of the document text, rather than the whole text.
Many documents contain suitable selections of text made by the author of the
document, such as the title itself, or probably better still, an abstract
or summary of the paper. Like the product of manual indexing, an abstract
or summary of a document is a precis of the document which distills the essen-
tial subject ideas into a few hundred words. The presence of bias or slant
in both indexing and abstract preparation may not favor the use of natural
language input, however, since in such a system there often exists no possi-
bility of picking out only those topics of interest to the users of the sys-
tem (as is possible in manual indexing[OCRerr] Th addition, for some documents
the natural language abstract may be a poorly written pre[OCRerr]is of the docu-
ment. [OCRerr]Thien an automatic system using abstracts is implemented it may be neces-
sary to make up these deficiencies by manual effort; procedures are also