ISR11
Scientific Report No. ISR-11 Information Storage and Retrieval
Information Analysis and Dictionary Construction
chapter
G. Salton
M. E. Lesk
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
iv-i6
Clearl[OCRerr], the operation which consists in using the sequence numbers
obtained from a null thesaurus for purposes of document and request
identification leads effectively to a word matching technique for document
retrieval, since sequence numbers and text words are in effect isomorphic.
The main virtues of the null thesaurus per se result from the fact that
the dictionary loo[OCRerr]-up routine programmed for the regular thesaur'[OCRerr]s will
serve also for the null thesaurus (because the structure of the two
the[OCRerr]auruses is the same), and that the null thesaurus permits theword
matching operation to be confined to only those words actually included
in the thesaurus (since the others will not have an assigned sequence number).
This raises a question about the type of null thesaurus which should
be used as a standard for the word matching operations. The following
alternative[OCRerr] appear of principal importance in this connection:
1) the null thesaurus can include complete English words, or can
alternatively be made up from word stems, obtained from the
original words by a suffix [OCRerr][OCRerr]-off;
2) an entry can be included in the null thesaurus for each text won
included in a certain document collection, or expected to be
important in a given topic area; or, alternatively, function words
and other words not easily used for content identification may be
excluded, or mar[OCRerr]ed with a special identi[OCRerr][OCRerr]ng code;
3) all non-common words, or word stems may be used, br only those
words which have certain predetermined frequency characteristics
(for example, words occurring more than 5 tL[OCRerr]es but less than 100
times in a given document collection).
In the SMART system, all dictionaries (including regular and null
thesauruses) are based on word stems rather than original words; fi[OCRerr]ther-
more, common words appear on an exclusion list, and are thus not