IRS13
Scientific Report No. IRS-13 Information Storage and Retrieval
Summary
summary
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
The effects of varying the length of the documents to be analyzed
automatically, and of using a variety of stored dictionaries in the ourse
of the analysis are examined in detail in sections V, VI, and VII by E. M.
Keen. Variations in document length are covered in section V, including a
comparison for all collections of the use of titles only with the use ab-
stracts, and of abstracts with full text for the ADI collection. Titles
are found to be unsatisfactory for high recall searches with all collections.
At the high precision end of the spectrum, good titles are sometimes
almost equivalent to poor abstracts. In general, however, abstracts are
superior to titles as a source for generating content identifiers. The
abstracts are found only slightly inferior to full text for the ADI
collection, suggesting that the increased expenditure of entering full
text is probably not warranted.
Two types of suffix cut-off procedures designed to reduce language
variability by transforming full word paradigms to word stems are evaluated
in section VI. The suffix `5' dictionary produces common word forms for
items which differ only in a final `S' whereas the word stem dictionary
reduces a complete family of related words to a common word stem. The stem
dictionary is found to be superior as a retrieval tool to the suffix [OCRerr]SI
dictionary for the IRE and ADI collections; the suffix `5' dictionary is
slightly superior to the stem dictionary for the Cranfield collection, pro-
bably because the more specialized aerodynamics vocabulary provides fewer
opportunities for word reduction. The suffix dictionaries offer a con-
venient mark against which the effectiveness of thesaurus-type dictionaries
can be measured.
xiv