Scientific Report No. IRS-13 Information Storage and Retrieval

IRS13 Scientific Report No. IRS-13 Information Storage and Retrieval An Experiment in Automatic Thesaurus Construction chapter R. T. Dattola D. M. Murray Harvard University Gerard Salton Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. VIII-24 class 70, query 5 might be expected to retrieve document 46, an irrelevant document. This is exactly what happens, for document 46 is ranked 23rd using the original thesaurus and is ranked 4th using THS 1. D) Dividing Weights A common objection against automatic thesauruses is that they contain too much overlap between concept classes. Thus, the concepts which occur in several concept classes (which are in fact the most common concepts) do not contribute much to the thesaurus, as their weights are divided by the number of concept classes in which they occur. The evaluation results using THS 1 and THS 2 indicate that manual thesauruses do not contain enough overlap, rather than automatic thesauruses contain too much overlap. How-. ever, an argument might be raised against dividing the weights. To settle this argument, THS 2 was also evaluated without dividing the weights during the lookup. The results were much worse: N.R. = .74, down from .83, and N.P. = .52, down from .65. E) Cranfield Collection Although the results from the ADI text are encouraging, the goal is to produce an automatic thesaurus starting from a word stem thesaurus rather than a regular manual thesaurus. Since the original concepts in a stem thesaurus are not themselves manually constructed concept classes, it can be expected that many more connections exist between the original concepts than in a regular thesaurus. Thus, THS 1 constructed from the Cranfield stem thesaurus contains too much overlap between concept classes. In fact, over