ISR11 Scientific Report No. ISR-11 Information Storage and Retrieval Information Analysis and Dictionary Construction chapter G. Salton M. E. Lesk Harvard University Gerard Salton Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. IV-13 well be eliminated in favor of a term such as 1'computer-controlt1, since the former are clearly ambiguous in many contexts whereas the latter is much more specific); 3) non-significant words should be studied carefully before any are included in the list of words to be eliminated (for example, a term such as tthand1' should be included in a thesaurus dealing with biology, but it should not' be included if its high frequency count is due to expressions such as "0fl the other hand'1); 4) ambiguous terms should be coded only for those senses which are likely to be present in the document collections to be treated (for example, at least two category numbers zmist be shown for the term 11fie1d11, corresponding on the one hand to the notion of subject area, and on the other hand to its technical sense in algebra; however, no category nu[OCRerr][OCRerr]ber need be shown to cover the notion of 11a patch of land" if the dictionary deals with the mathematical sciences or related technical fields); 5) each concept class should only include terms of roughly equal frequency so that the matching characteristics are approximately the same for each term within a category. Consider as an example some of the synonym dictionaries constructed for use with the SMART retrieval system. In that system it [OCRerr] found useful to operate with a reasonably large number of concept classes (of the order of 700 for a given restricted subject field), and to use also a large list of non-significant words to be excluded from the content indications. This list includes in particular verbs such as "begin", `1contain", "indicate", "call", 11designate" etc., which could not be depended upon to provide safe content indication. It was also found useful to isolate high frequency terms into separate categories so that these terms would not impair the retrieval effectiveness of other more specific terms.