ISR11 Scientific Report No. ISR-11 Information Storage and Retrieval Information Analysis and Dictionary Construction chapter G. Salton M. E. Lesk Harvard University Gerard Salton Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. lv-ll for ?Morse Code7 could also produce many other documents dealin& with coding syst[OCRerr]s, but not [OCRerr]Tith the specific system wanted. Once the words to be included in the dictionary are chosen, the second main problem which arises is the one dealing with the type of synonym categories to be created. It is clear that if very broad and somewhat fuzzy categories are wanted, such that a given category includes both somewhat specific terms and also somewhat broader ones, then the resulting dictionary [OCRerr]rill in general interpret a question in a reasonably broad sense, and as a result the recall, that is the proportion of relevant documents retrieved, will likely be rather high. At the same time the precision may be low, since it must be expected that much irrelevant material [OCRerr][OCRerr]ll also be produced in the process. If on the other hand the categories are very specific, the chance of picking up irrelevancies is much smaller and therefore the precision is increased; the recall may suffer, however, since relevant matter is likely to be missed at the same time. In either case, that is whether the categories used are broad or specific, problems [OCRerr]Till arise if words with very different frequency characteristics are included in the same category. Obviously the effectiveness of the specific terms is much smaller, if these terms are in fact considered equivalent to broader terms of hi[OCRerr]her frequency by the applicable thesaurus mapping. This discussion then raises the possibility of providing different thesauruses for different types of questions. Specifically, if it is expected that the user is interested in reasonably complete retrieval, including most everything that is likely to be useful, then the thesaurus with broad categories which provides high recall and low precision should