ISR11 Scientific Report No. ISR-11 Information Storage and Retrieval Information Analysis and Dictionary Construction chapter G. Salton M. E. Lesk Harvard University Gerard Salton Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. Iv-15 automatic thesaurus construction using aids in the form of frequency lists arid word concordances are also described. B) The Null Thesaurus arid Suffix List One of the earliest ideas in automatic information retrieval was the suggested use of words contained in documents and search requests for purposes of content identification. No elaborate content analysis is then required, and the similarity between different items can be measured simply by the amount of overlap between the respective vocabularies. While one should not expect that word matching techniques alone will normally provide adequate retrieval performance, it is nevertheless useful to consider a word matching technique as part of a retrieval system, since this provides a standard against which various types of dictionary procedures may be measured. This was one of the reasons for including in the SMART system the so-called null thesaurus.[2,3] The null thesaurus consists simply of a list of word stems, con- structed by using the words included in a typical document collection, each distinct word stem being furnished with a different sequence number. The sequence numbers in the null thesaurus are then equivalent to the concept numbers included in the regular thesaurus, with the exception that each sequence number, of course, has only a single correspondent (words or word stem) in the null thesaurus, compared to the possible multiple correspondences in the, regular thesaurus. A typical sample from a null thesaurus is shown in Fig. 3, where the word stems are listed in the order of increasing frequency of occurrence within a document collection, rather than in the usual alphabetic order.