ISR11
Scientific Report No. ISR-11 Information Storage and Retrieval
Operating Instructions for the SMART Text Processing and Document Retrieval System
chapter
M. E. Lesk
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
11-3
The analysis of English language text for the detection of concepts
is performed without human intervention. Text is analyzed by a set of
programs which operate in conjunction with two types of inputs:
a) A set of dictionaries, grammars, and hierarchies specifying the
relations between properties of the input English text and of
the concepts. The simplest possible dictionary, for ex[OCRerr][OCRerr]le,
associates with each English word a distinct concept. A more
sophisticated dictionary, consisting in fact of a word thesaurus,
will define a many-to-many mapping of English word stems into
the concept vector space in which some words are isolated as
c[OCRerr]nmon words, some are considered anbiguous and resolved into
their various possible meanings, and some are combined with
syno[OCRerr]ous terms.
b) A set of specifications which describe in what way the various
content analysis programs are to be applied and which dictionaries
are to be used. Specifications are used by the program to decide
what documents are to be processed, what dictionaries should be
used, what algorithms for associating concepts with documents
should be employed, how much weight each contribution to the
document representation should be given, what procedures should
be utilized to compare requests with documents, and what output
is desired from the run.
Once the text analysis has been performed, and concept vectors are
available for all documents and all requests that have entered into the
system, the programs proceed by comparing the requests with the documents
to determine which documents are to be identified as `tanswers't to the
requests. This list may then be compared with a 1tcorrect" list, obtained
as a result of a manual operation and supplied by the programmer, and
various evaluation measures may be computed automatically.