IRS13
Scientific Report No. IRS-13 Information Storage and Retrieval
An Analysis of the Documentation Requests
chapter
E. M. Keen
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
X-35
as a proportion of the total non-relevant in the collection is 0.[OCRerr]5 with
documentation, 0.23 with aerodynamics and 0.17 with computer science. At
the other end of the scale, examining the output until 0.03 of the non-
relevant is encountered gives recall values of .16 with documentation, .1*3
with aerodynamics and .51 with computer science. Factors in the text en-
vironments differ between the collections, and matters such as the quality
of terminology in the subject language, as well as the testing of techniques
used for collection gathering, request preparation and relevance decisions
all contribute to the differences observed in unnnown proportions.
It seems likely that the imprecise terminology encountered in docu-
menation which appears in both the documents and requests is a major cause
of the poor perforniance, and in order to overcome these problems extra
human intellect may be needed in the system. It may not be possible to
build synonym dictionaries that will entirely provide for this, but good dic-
tionaries together with a good choice of search strategy is likely to improve[OCRerr]
performance considerably. Some proof of the value of carefully chosen search
words is given in Figure 21, where a hand search of the [OCRerr]IC type concor-
dance to the abstracts is compared with a SMART abstracts Thesaurus result.
The hand searche[OCRerr] chose up to five keywords for each request and was allowed
to use any words that might be considered useful as suggested by the request
statement. A comparison with SMART in Figure 21 a) is made after fitting
the SMART results to the hand searches by making cut-offs in the SMART ranked
output in such a way that the number of documents retrieved for each request
are identical to the hand searched results, Figure 21 b) shows two fuller
SMART curves obtained by making a series of cut-offs after one document,
two documents, and so on, up to the last document in the collection. It is
not surprising that the hand searches work better, since the free choice of