ISR10
Scientific Report No. ISR-10 Information Storage and Retrieval
Appendix A: The Smart System
appendix
Joseph John Rocchio
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
A-i
AP?E[OCRerr]IX A
SMAI[OCRerr]T SYS[OCRerr]
The SMA[OCRerr][OCRerr]T automatic document retrieval system currently
running on the IBM 7094 digital computer at Rarvard University was
used both as a simulation enviro[OCRerr]iment and data base generator for the
experimental results presented in this thesis. As the SMART system
has been thoroughly documented (references i-[OCRerr]), only a brief summary
of its main features is outlined here.
A. Content Analysis Techniques
The indexing function of the SMART system is capable of
incorporating a number of automatic content analysis techniques. Docu-
ments are entered into the system in the natural language (with a
minimal number of keypunching conventions) and passed through a
dictionary lockup phase. The lookup operates with a stem-suffix
splitting algorithm (which incorporates spelling rules), and word.
stems are matched against entries of a stored dictionary. A variety
of dictionaries may be used in the system ranging from a simple one to
one encoding (keyword dictionary) to a dictionary which produces a
many to many thesaurus-type mapping. In addition to providing a
semantic encoding for the detected stems, the lookup process has
provisions for providing syntactic stem codes based on both the stem
and suffix dictionaries. After the [OCRerr]nit[OCRerr][OCRerr]I lockup phase, a coded