ISR11
Scientific Report No. ISR-11 Information Storage and Retrieval
Operating Instructions for the SMART Text Processing and Document Retrieval System
chapter
M. E. Lesk
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
11-12
An additional phrase searching method can be envisioned in which
statistical properties of words are analyzed to determine which words
appear to be occurring as phrases. SM[OCRerr]T has in the past contained such
procedures, but due to the poor results obtained with the algorithms
tried, these have not been iN[OCRerr]lemented in the present version. New
algorithms are under consideration, however, and specification CLUS[OCRerr]
has been maintained in the supervisor, indicating a call to a loo[OCRerr]
routine using statistically detected phrases. However, this specifi-
cation is currently inoperative.
3.3. Vector Expansions by Means of Concept-Concept Correlation
Concept vectors obtained from the words and phrases in one docu-
ment may be expanded based on statistical data obtained from an entire
document collection. In this way, local variations in individual docu-
ment vocabularies can be corrected. Procedures for these statistical
expansions are available in SMART through the form of concept-concept
correlation. This option involves the formation of a c[OCRerr]nplete concept-
document occurrence matrix from the concept vectors of all the text
submitted in the rLm. Different concepts are now correlated, to find cut
which concepts appear to exhibit a similar occurrence pattern in the
documents. This c[OCRerr]arison is made on the basis of the rows of the
transposed concept document matrix in which each concept is represented
by rows of elements consisting of the numeric occurrences of the concepts
in the successive documents.
The correlation algorithms are as follows. Let the concept-docu-
ment matrix be called A, so that A.. represents the n[OCRerr]m[OCRerr]er of occurrences
13
of concept i in document [OCRerr]. Then, the correlation coefficient r is a
pq