ISR11
Scientific Report No. ISR-11 Information Storage and Retrieval
Operating Instructions for the SMART Text Processing and Document Retrieval System
chapter
M. E. Lesk
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
11-2
It is au[OCRerr]ed that the reader has a general understanding of the
purpose of the S[OCRerr]T system. Reference to [OCRerr]2) or [5] provide this
bac[OCRerr]graund. The main purpose of the present section is to replace the
obsolete operating instructions in [6).
1.1. Processing Suzin&ry
S[OCRerr]T accepts as input natural language English text in a form
close to normal typing. Input is basically of two forms: requests, or
querie8 for information; and documents, or individual units of a collec-
tion of English articles, or abstracts, which are coaaared with the
requests.
The search procedure makes use of a flexible data representation
system in which each document is analyzed into a "concept vector". The
concept vector consists of a list of 1'concepts" (each with a weight) that
are associated with the document. If each possible concept is imagined as
a dimension, or a direction in a very-high dimensional space, the "concept
vector't representation of a document is now equivalent to an n-dimensional
vector; a concept vector is thus readily compared with other, similar
representations by simple correlation procedures.
The meaning of a "concept" can differ greatly from run to run. A
concept may be simply an English work; or it may be a set of English words;
or it may be a phrase or set of phrases; or it may be a node label in a
hierarchical classification system; or any carz[OCRerr]bination of the above.
Rrovision is also made for future inclusion of other types of concepts;
e.g., concepts derived from an author's nsme or from a [OCRerr]ournal citation.
In short, a concept may represent virtually any feature of the text of a
document which reflects the document content.