ISR11
Scientific Report No. ISR-11 Information Storage and Retrieval
Information Analysis and Dictionary Construction
chapter
G. Salton
M. E. Lesk
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
lv-ll
for ?Morse Code7 could also produce many other documents dealin& with
coding syst[OCRerr]s, but not [OCRerr]Tith the specific system wanted.
Once the words to be included in the dictionary are chosen, the
second main problem which arises is the one dealing with the type of
synonym categories to be created. It is clear that if very broad and
somewhat fuzzy categories are wanted, such that a given category includes
both somewhat specific terms and also somewhat broader ones, then the
resulting dictionary [OCRerr]rill in general interpret a question in a reasonably
broad sense, and as a result the recall, that is the proportion of
relevant documents retrieved, will likely be rather high. At the same
time the precision may be low, since it must be expected that much irrelevant
material [OCRerr][OCRerr]ll also be produced in the process. If on the other hand the
categories are very specific, the chance of picking up irrelevancies is
much smaller and therefore the precision is increased; the recall may
suffer, however, since relevant matter is likely to be missed at the same
time. In either case, that is whether the categories used are broad or
specific, problems [OCRerr]Till arise if words with very different frequency
characteristics are included in the same category. Obviously the
effectiveness of the specific terms is much smaller, if these terms are
in fact considered equivalent to broader terms of hi[OCRerr]her frequency by the
applicable thesaurus mapping.
This discussion then raises the possibility of providing different
thesauruses for different types of questions. Specifically, if it is
expected that the user is interested in reasonably complete retrieval,
including most everything that is likely to be useful, then the thesaurus
with broad categories which provides high recall and low precision should