ISR11 Scientific Report No. ISR-11 Information Storage and Retrieval Design Consideration for Time Shared Automatic Documentation Centers chapter M. E. Lesk Harvard University Gerard Salton Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. group, Al? (the American Insti[OCRerr]ute of Physic[OCRerr]) e:'ists, and it is currently interested in documentation problems. [OCRerr][OCRerr]ny other groups, such as [OCRerr][OCRerr]ratom, NASA, and the ABC are also interested in any documentation efforts in physics. A file of 25,000 articles a year (about 25-50 journals of 500-1000 articles per year) kept up for ten years should be very attractive to many users. Certain difficulties would arise with physics, of course: there e[OCRerr]-ists a large technical report literature which should be included, but which is largely unabstracted and inaccessible. Also, much strange and inconvenient symbolism is used in writing papers. But these problems are not insuperable, and physics could thus easily serve as the basic collection. We may then assume that the basic collection would contain about 7 250,000 100-word abstracts, or a total of 2.5x10 English words. Thi[OCRerr] represents a total data input of about l0[OCRerr] bits and will require about ten to twenty reel[OCRerr] of magnetic tape to store. It may be expected to 5 contain on the order of 10 different English words, and the most frequently occurring few thousand words will likely include 90[OCRerr] of the total number of word occurrences. This fact can be used in the construction of an efficient dictionary lookup. `.[OCRerr]en the SMART programs are loaded into memory, as part of the user sign-in procedure, the programs will be accompanied by a short dictionary of 1000 or 2000 words. The user requests will probably be fairly short, about 25 words. They can be looked up in the special high- frequency list in a few milliseconds. Perhaps a few words will remain which were not included in this special list. Based on the first few letters of the word, a computation of its approximate position in the backup dictionary is made, and the appropriate section of the complete