ISR11
Scientific Report No. ISR-11 Information Storage and Retrieval
Design Consideration for Time Shared Automatic Documentation Centers
chapter
M. E. Lesk
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
group, Al? (the American Insti[OCRerr]ute of Physic[OCRerr]) e:'ists, and it is currently
interested in documentation problems. [OCRerr][OCRerr]ny other groups, such as [OCRerr][OCRerr]ratom,
NASA, and the ABC are also interested in any documentation efforts in
physics. A file of 25,000 articles a year (about 25-50 journals of 500-1000
articles per year) kept up for ten years should be very attractive to many
users. Certain difficulties would arise with physics, of course: there
e[OCRerr]-ists a large technical report literature which should be included, but
which is largely unabstracted and inaccessible. Also, much strange and
inconvenient symbolism is used in writing papers. But these problems are
not insuperable, and physics could thus easily serve as the basic collection.
We may then assume that the basic collection would contain about
7
250,000 100-word abstracts, or a total of 2.5x10 English words. Thi[OCRerr]
represents a total data input of about l0[OCRerr] bits and will require about
ten to twenty reel[OCRerr] of magnetic tape to store. It may be expected to
5
contain on the order of 10 different English words, and the most frequently
occurring few thousand words will likely include 90[OCRerr] of the total number of
word occurrences.
This fact can be used in the construction of an efficient dictionary
lookup. `.[OCRerr]en the SMART programs are loaded into memory, as part of the
user sign-in procedure, the programs will be accompanied by a short
dictionary of 1000 or 2000 words. The user requests will probably be
fairly short, about 25 words. They can be looked up in the special high-
frequency list in a few milliseconds. Perhaps a few words will remain
which were not included in this special list. Based on the first few
letters of the word, a computation of its approximate position in the
backup dictionary is made, and the appropriate section of the complete