Scientific Report No. IRS-13 Information Storage and Retrieval

IRS13 Scientific Report No. IRS-13 Information Storage and Retrieval Test Environment chapter E. M. Keen Harvard University Gerard Salton Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. 1-41 that is valid outside the particular test. Other parameters such as request length and request concept frequency are used in the study in Section X. C) Collection Comparisons The data which describe the test environments in Figs. 1, 3, 4, and 5 reveals many points at which the environments differ, such as collection and request sizes, collection and request average lengths, request generality, request preparation and relevance decisions, and so on It is recognized that at present, it is not possible to sufficiently control these variables so that comparisons between collections can be made under the assumption that the effects of these variables have been adequately controlled. Suitable control of these and other so far unrecognized variables would permit com- parisons between collections of documents in different subject areas. This might be of interest since `the terminology of different subject areas might be regarded as lying on a continuum ranging from 1'hard'1 or "firm" subject areas to "soft" or "mushy" as suggested by Cleverdon (16]. This may be a valid hypothesis, since in data retrieval situations in some areas of chemistry, the firm language permits simultaneous high recall with high precision performances, whereas in other areas such as parts of the social sciences the imprecise language often produces very much poorer precision recall curves. Alternatively, it may be the ca&[OCRerr] that subject fields contain sub-areas of soft and firm ter- minology: in aerodynamics, for example, descriptions of wing shapes and aspect ratios seem to be fairly unambiguous, whereas treatment of gas and fluid flow phenomena seems to abound with ambiguities.