IRS13
Scientific Report No. IRS-13 Information Storage and Retrieval
Document Length
chapter
E. M. Keen
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
V-59
(Computer Science), with the stem dictionary on the ADI Collection
(Documentation), and with both stem and thesaurus dictionaries
emplo[OCRerr]ing weighting and the cosine correlation on the Cran-l Col-
lection (Aerodynamics). Titles perform better than abstracts on
ADI using the thesaurus, which is probably due to poor abstracts
rather than good titles. Titles also perform well on Cran-l when
simple matching (overlap correlation) and no weights (Logical
Vectors) are used; this is due to the very good length and quality
of titling in aerodynamics.
c) The use of abstracts in the ADI collection was only slightly
inferior to full text at high precision using the stem diction([OCRerr]y,
and at high recall using the stem and thesaurus dictionaries. It
is suggested that the increase in recall/precision performance and
increase in recall ceiling from 0.92 to 1.00 is unlikely to be worth
the increased input and storage costs, and extended search time, and
the use of slightly longer abstracts would show the text to have
no advantages at all. Further work on full text processing of a more
typical set of technical documents in another subject area is required.
d) The use of abstracts in the Cran-l Collection gave a somewhat
inferior performance to the shorter precis made by the manual indexers
on the Cranfield Project. F'urther work is required to determine
whether the appare[OCRerr]tly good quality abstracts suffer either from
excessive length or failure to include some vital subject notions
that the indexers included. The abstract performance is, however,
sufficiently good to question the need for indexing for high perfor-
mance, particularly since the indexing was more exhaustive than is
practiced in many operational situations.