Scientific Report No. IRS-13 Information Storage and Retrieval

IRS13 Scientific Report No. IRS-13 Information Storage and Retrieval Document Length chapter E. M. Keen Harvard University Gerard Salton Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. V-51 documents, but the abstracts give a much better performance. The ADI titles are short also, but analysis has shown that in cases where the title does not give a good performance, a search of the full text frequently results in a poor performance for these cases also. The cause of this is probably a com- bination of the relevance decisions used, and synonym recognition problems, since the subject terminology in documentation is thought to be less precise than in the other collections. Where requestor needs are covered by whole documents treating the topic of the request, titles alone may frequently be adequate, and K[OCRerr]IC title indexes have proved to be useful tools for such needs. It was noted that in a subset of the requests used in the Cranfield Project tests, 31% of the relevant document titles in a set of 35 requests had a strong match with the search request [2, pages 36-39]. But requestor needs are not always for whole documents, since relevant portions of a document frequently answer a need as completely as a whole document. In these cases, titles are quite inadequate, and a more exhaustive selection from the text is essential for good retrieval. Four examples are given comparing abstracts to full text using the ADI Collection, in Figures 35 to 39. The request statements are given, together with the words matching the documents, with matching aided by a the- saurus dictionary. Figure 35 [OCRerr]hows a case where the increased matching on full text improves performance, whereas Figure 36 shows how an increased matching on full text can worsen performance. Figure 37 shows a case where matching and performance were unchanged by use af full text, since the weight of the important term "journals" was increased from 14 to 30 on text from abstracts. Figure 38 shows a case where the increased weights provided by full text fail to prevent a non-relevant document from receiving a rank