ISR11
Scientific Report No. ISR-11 Information Storage and Retrieval
Design Consideration for Time Shared Automatic Documentation Centers
chapter
M. E. Lesk
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
X-5
3. Methods
We can draw on the experience obtained by using the SM£[OCRerr]T project
to select the processing methods which should be used in the planned
system. The results of the SM[OCRerr]RT project on the relative values of
various retrieval methods are developed elsewhere, Li] and only a brief
summary of some of the relevant points is given here.
For input p1arposes, the best compromise between economy of space
and quantity of information is probably the document abstract. Since most
scientific journals require author abstracts, it should not be difficult
to obtain a set of abstracts for the document collection being searched.
The search procedure should be based on the use of a thesaurus vTith
phrases. In. past experiments -[OCRerr]th the S[OCRerr][OCRerr]RT system, this method has
been found to offer the best performance of any method tested on most
collections. This method e[OCRerr]ibits the additional advantages of simplicity
and flexibility. Specialized thesauri can be constructed for individual
needs. Isolated errors are easily corrected. [OCRerr]xtensions of different
languages and adaptations to different subject areas are possible. On
the other hand, statistical procedures for automatic synonym detection,
are relatively fixed procedures for which adjustments are[OCRerr]more difficult to
mai[OCRerr]. It is not clear how such methods can be extended to different
languages. Finally, automatic synonym detection is found experimentally
to produce results inferior to those obtained by proper thesauri.
Hierarchies also produce inferior results.
Based on past SMART experience, we accept as our basic content analysis
procedure a thesaurus lookup, and a loose phrase lookup of the type studied
there. The entire document collection is passed through this lookup at