ISR11 Scientific Report No. ISR-11 Information Storage and Retrieval Information Analysis and Dictionary Construction chapter G. Salton M. E. Lesk Harvard University Gerard Salton Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. iv-i6 Clearl[OCRerr], the operation which consists in using the sequence numbers obtained from a null thesaurus for purposes of document and request identification leads effectively to a word matching technique for document retrieval, since sequence numbers and text words are in effect isomorphic. The main virtues of the null thesaurus per se result from the fact that the dictionary loo[OCRerr]-up routine programmed for the regular thesaur'[OCRerr]s will serve also for the null thesaurus (because the structure of the two the[OCRerr]auruses is the same), and that the null thesaurus permits theword matching operation to be confined to only those words actually included in the thesaurus (since the others will not have an assigned sequence number). This raises a question about the type of null thesaurus which should be used as a standard for the word matching operations. The following alternative[OCRerr] appear of principal importance in this connection: 1) the null thesaurus can include complete English words, or can alternatively be made up from word stems, obtained from the original words by a suffix [OCRerr][OCRerr]-off; 2) an entry can be included in the null thesaurus for each text won included in a certain document collection, or expected to be important in a given topic area; or, alternatively, function words and other words not easily used for content identification may be excluded, or mar[OCRerr]ed with a special identi[OCRerr][OCRerr]ng code; 3) all non-common words, or word stems may be used, br only those words which have certain predetermined frequency characteristics (for example, words occurring more than 5 tL[OCRerr]es but less than 100 times in a given document collection). In the SMART system, all dictionaries (including regular and null thesauruses) are based on word stems rather than original words; fi[OCRerr]ther- more, common words appear on an exclusion list, and are thus not