ISR11 Scientific Report No. ISR-11 Information Storage and Retrieval Information Analysis and Dictionary Construction chapter G. Salton M. E. Lesk Harvard University Gerard Salton Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. iv-4o inc[OCRerr].uded in the reg'[OCRerr]ar thcsaurus cannot be generated by the null process. D'rentually, as more documents are retrieved, the performance of the null thesaurus which offers no synonym detection at all becomes less attractive. The I[OCRerr]rris 3 dictionary is competitive with the null ([OCRerr][OCRerr]ctionary for precision, but also maintains the recall advantage by careful isolation of high frequency words, and by the corresponding promotion of important low frequency words. As an example of the performance of synonym cictionaries, consider the search result obtained -4th a collection on aeronautical engineering for a request whose te:[OCRerr] reads Tthow does scale height vary with altitude in an atmosphere. The ranled output in decreasin[OCRerr] correlation order with the search request sho[OCRerr]m in Table II indicates that more relevant documents have low ranks (and therefore high correlation with the request) for the regular thesaurus procedure than for the null thesaurus. Moreover, the regular thesaurus has succeeded in [OCRerr]romotin[OCRerr] a number of relevant doc'i[OCRerr]ents, such as documents number 61-(, 621, l[OCRerr]+ and 302. One of the promoted documents, number 621 is [OCRerr]ound to contain the sentence ttvariations `[OCRerr]n air density between day and night in the region 190 to 280 km are found to be small". This sentence contains no matching words with the request, and is therefore useless for a word matching procedure. The regular thesaurus, however, contains both `air" and "atmosphere" in the same concept class, thus ex[OCRerr]laining in part why the rank of document 621 improves from l[OCRerr]th for the null thesaurus to [OCRerr]th for the regular synonym dictionary. The same t[OCRerr]'pe of analysis reveals that the relevant document 15+ contains a sentence reading 11density data are given for the altitude range of 370 to [OCRerr]00 km", which is again used by the thesaurus since `Taltitude" and "height" are grouped in a common class.