Scientific Report No. IRS-13 Information Storage and Retrieval

IRS13 Scientific Report No. IRS-13 Information Storage and Retrieval An Analysis of the Documentation Requests chapter E. M. Keen Harvard University Gerard Salton Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. x-lo Definitions of good and bad performance are arbitrary, but it is thought that good performance requires the rank position of a relevant document to be at least 15, and anything positioned lower than this is a poor result. Any requests which fall into groups a) and c) were thought to be particularly useful for analysis; in practice, however, all 38 requests fall into group b). Requests [OCRerr] and Bl4 perform well on nearly all options, but occasionally one of the relevant documents falls below rank position 10. There occurs a sur- prisingly large amount of change in the ranks of the relevant when options are tested; Figure 2 gives an example for one request and two relevant docu- ments. In this request, all the options that are found on average to be the poorest, such as titles only, the use of cosine logical, and the "Hastie" Thesaurus give the best results. Since the division into groups by performance achieved does not assist in the analysis, another method of analysis is suggested: this in to look for strong correlation between measurable request characteristics and the use of particular performance options. A s[OCRerr]ary of possible request characteristics is given in Figure 3, some of which have been described pre- viously; these can now be used to look for direct correlation between charac- teristics and performance, as attempted in sections SB, SC, and SD. B) Variation in Generality, Length and Concept Frequency Request generality refers to the nun[OCRerr]er of documents in the collec- tion that are relevant; using this principle, the request set may be divided into specific and general requests With the 35 requests divided into sets of 17 and 18, request generality data is given in Figure 4 together with evaluation results of normalized recall and precision, couparing the stem and thesaurus dictionaries. As has been observed previously [2J, the ppecific requests give