ISR11
Scientific Report No. ISR-11 Information Storage and Retrieval
Information Analysis and Dictionary Construction
chapter
G. Salton
M. E. Lesk
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
Iv-15
automatic thesaurus construction using aids in the form of frequency
lists arid word concordances are also described.
B) The Null Thesaurus arid Suffix List
One of the earliest ideas in automatic information retrieval was
the suggested use of words contained in documents and search requests for
purposes of content identification. No elaborate content analysis is
then required, and the similarity between different items can be measured
simply by the amount of overlap between the respective vocabularies. While
one should not expect that word matching techniques alone will normally
provide adequate retrieval performance, it is nevertheless useful to
consider a word matching technique as part of a retrieval system, since
this provides a standard against which various types of dictionary procedures
may be measured. This was one of the reasons for including in the SMART
system the so-called null thesaurus.[2,3]
The null thesaurus consists simply of a list of word stems, con-
structed by using the words included in a typical document collection,
each distinct word stem being furnished with a different sequence number.
The sequence numbers in the null thesaurus are then equivalent to the
concept numbers included in the regular thesaurus, with the exception
that each sequence number, of course, has only a single correspondent
(words or word stem) in the null thesaurus, compared to the possible
multiple correspondences in the, regular thesaurus. A typical sample
from a null thesaurus is shown in Fig. 3, where the word stems are
listed in the order of increasing frequency of occurrence within a
document collection, rather than in the usual alphabetic order.