ISR11 Scientific Report No. ISR-11 Information Storage and Retrieval Operating Instructions for the SMART Text Processing and Document Retrieval System chapter M. E. Lesk Harvard University Gerard Salton Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. 11-12 An additional phrase searching method can be envisioned in which statistical properties of words are analyzed to determine which words appear to be occurring as phrases. SM[OCRerr]T has in the past contained such procedures, but due to the poor results obtained with the algorithms tried, these have not been iN[OCRerr]lemented in the present version. New algorithms are under consideration, however, and specification CLUS[OCRerr] has been maintained in the supervisor, indicating a call to a loo[OCRerr] routine using statistically detected phrases. However, this specifi- cation is currently inoperative. 3.3. Vector Expansions by Means of Concept-Concept Correlation Concept vectors obtained from the words and phrases in one docu- ment may be expanded based on statistical data obtained from an entire document collection. In this way, local variations in individual docu- ment vocabularies can be corrected. Procedures for these statistical expansions are available in SMART through the form of concept-concept correlation. This option involves the formation of a c[OCRerr]nplete concept- document occurrence matrix from the concept vectors of all the text submitted in the rLm. Different concepts are now correlated, to find cut which concepts appear to exhibit a similar occurrence pattern in the documents. This c[OCRerr]arison is made on the basis of the rows of the transposed concept document matrix in which each concept is represented by rows of elements consisting of the numeric occurrences of the concepts in the successive documents. The correlation algorithms are as follows. Let the concept-docu- ment matrix be called A, so that A.. represents the n[OCRerr]m[OCRerr]er of occurrences 13 of concept i in document [OCRerr]. Then, the correlation coefficient r is a pq