ISR11 Scientific Report No. ISR-11 Information Storage and Retrieval Operating Instructions for the SMART Text Processing and Document Retrieval System chapter M. E. Lesk Harvard University Gerard Salton Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. 11-2 It is au[OCRerr]ed that the reader has a general understanding of the purpose of the S[OCRerr]T system. Reference to [OCRerr]2) or [5] provide this bac[OCRerr]graund. The main purpose of the present section is to replace the obsolete operating instructions in [6). 1.1. Processing Suzin&ry S[OCRerr]T accepts as input natural language English text in a form close to normal typing. Input is basically of two forms: requests, or querie8 for information; and documents, or individual units of a collec- tion of English articles, or abstracts, which are coaaared with the requests. The search procedure makes use of a flexible data representation system in which each document is analyzed into a "concept vector". The concept vector consists of a list of 1'concepts" (each with a weight) that are associated with the document. If each possible concept is imagined as a dimension, or a direction in a very-high dimensional space, the "concept vector't representation of a document is now equivalent to an n-dimensional vector; a concept vector is thus readily compared with other, similar representations by simple correlation procedures. The meaning of a "concept" can differ greatly from run to run. A concept may be simply an English work; or it may be a set of English words; or it may be a phrase or set of phrases; or it may be a node label in a hierarchical classification system; or any carz[OCRerr]bination of the above. Rrovision is also made for future inclusion of other types of concepts; e.g., concepts derived from an author's nsme or from a [OCRerr]ournal citation. In short, a concept may represent virtually any feature of the text of a document which reflects the document content.