Scientific Report No. IRS-13 Information Storage and Retrieval

IRS13 Scientific Report No. IRS-13 Information Storage and Retrieval An Analysis of the Documentation Requests chapter E. M. Keen Harvard University Gerard Salton Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. X-35 as a proportion of the total non-relevant in the collection is 0.[OCRerr]5 with documentation, 0.23 with aerodynamics and 0.17 with computer science. At the other end of the scale, examining the output until 0.03 of the non- relevant is encountered gives recall values of .16 with documentation, .1*3 with aerodynamics and .51 with computer science. Factors in the text en- vironments differ between the collections, and matters such as the quality of terminology in the subject language, as well as the testing of techniques used for collection gathering, request preparation and relevance decisions all contribute to the differences observed in unnnown proportions. It seems likely that the imprecise terminology encountered in docu- menation which appears in both the documents and requests is a major cause of the poor perforniance, and in order to overcome these problems extra human intellect may be needed in the system. It may not be possible to build synonym dictionaries that will entirely provide for this, but good dic- tionaries together with a good choice of search strategy is likely to improve[OCRerr] performance considerably. Some proof of the value of carefully chosen search words is given in Figure 21, where a hand search of the [OCRerr]IC type concor- dance to the abstracts is compared with a SMART abstracts Thesaurus result. The hand searche[OCRerr] chose up to five keywords for each request and was allowed to use any words that might be considered useful as suggested by the request statement. A comparison with SMART in Figure 21 a) is made after fitting the SMART results to the hand searches by making cut-offs in the SMART ranked output in such a way that the number of documents retrieved for each request are identical to the hand searched results, Figure 21 b) shows two fuller SMART curves obtained by making a series of cut-offs after one document, two documents, and so on, up to the last document in the collection. It is not surprising that the hand searches work better, since the free choice of