IRS13 Scientific Report No. IRS-13 Information Storage and Retrieval Summary summary Harvard University Gerard Salton Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. The effects of varying the length of the documents to be analyzed automatically, and of using a variety of stored dictionaries in the ourse of the analysis are examined in detail in sections V, VI, and VII by E. M. Keen. Variations in document length are covered in section V, including a comparison for all collections of the use of titles only with the use ab- stracts, and of abstracts with full text for the ADI collection. Titles are found to be unsatisfactory for high recall searches with all collections. At the high precision end of the spectrum, good titles are sometimes almost equivalent to poor abstracts. In general, however, abstracts are superior to titles as a source for generating content identifiers. The abstracts are found only slightly inferior to full text for the ADI collection, suggesting that the increased expenditure of entering full text is probably not warranted. Two types of suffix cut-off procedures designed to reduce language variability by transforming full word paradigms to word stems are evaluated in section VI. The suffix `5' dictionary produces common word forms for items which differ only in a final `S' whereas the word stem dictionary reduces a complete family of related words to a common word stem. The stem dictionary is found to be superior as a retrieval tool to the suffix [OCRerr]SI dictionary for the IRE and ADI collections; the suffix `5' dictionary is slightly superior to the stem dictionary for the Cranfield collection, pro- bably because the more specialized aerodynamics vocabulary provides fewer opportunities for word reduction. The suffix dictionaries offer a con- venient mark against which the effectiveness of thesaurus-type dictionaries can be measured. xiv