IRE Information Retrieval Experiment Retrieval system tests 1958-1978 chapter Karen Sparck Jones Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. The decade 1958-1968 223 of a range of subtests, the figures are an illustrative selection; moreover, when ranges of performance figures are given, these may, for projects with rather heterogeneous subtests, be only for the strictly comparable alternatives of a single subtest. The variations in individual project findings is well illustrated by Montague's different experiments, where precision ranged from 4-9 per cent in one case, with relative recall 83-31 per cent, to 4[OCRerr]74 per cent and 93-31 per cent respectively in another. Considering the tests comparable in type, i.e. in objective and form, the tests conducted by Schuller, Altmann, and Shaw and Rothman, with Cranfield 1+, can be considered as a group, along with those of Sinnett, Cohen et al., Montague, and van Oot et al. on links and roles. The individual projects report differences in precision ranging from 12.5[OCRerr]1 .7 per cent (Schuller), 51.4-96.5 per cent (Cohen et al., for variable sets), 4[OCRerr]74 per cent (Montague) or 67.3-88.7 per cent (Altmann), and iluery relative recall from 31-93 per cent (Montague) or 80.7-100 per cent (Cohen et al.) Taking precision and (relative) recall together for comparable data runs in multi-test projects we get such variations as 12.5 per cent precision and 73.1 per cent recall-41 .7 per cent precision and 77.4 per cent recall (Schuller), 42 per cent precision and 57 per cent recall-55 and 66 per cent (Shaw and Rothman), 57.4 per cent precision and 100 per cent recall-94.0 and 80.7 per cent (Cohen et al., with variable request sets), and 70 per cent precision and 84 per cent recall-90 and 77 per cent (van Oot et al.). The operational non-comparative studies of Herner et al., Melton, and Lancaster were broadly in the 5[OCRerr]60 per cent range for both recall and precision. For recall (sensitivity) alone, CWRU results ranged from 16 to 98 per cent, while the normalized recall results for Cranfield 2 ranged from 44.6 to 65.8 per cent. It may be noted that the specificity results for CWRU ranged from 12 to 98 per cent. Over all these tests taken together, results range from a low of 4 per cent (Montague) to a high of 96.5 per cent (Cohen et al.) in precision, and, as far as the comparison is proper, from 31 to 100 per cent in recall (Montague and Cohen et al. respectively). These variations can or could, as noted, be accounted for partly by methodological differences and partly, of course, by the real properties of the languages being investigated, or the values of dependent variables like indexing exhaustivity; they might also be attributable to environment factors like collection subject area. These points are more fully discussed later. It must nevertheless be emphasized that the variations are not wholly explicable: if they were we would know how to design information retrieval systems; and the sheer scale of observed performance variation is worth noticing. The interpretations of the findings are equally varied, though there is a natural tendency, for the more limited tests, for their authors to conclude that whatever was to be demonstrated has been demonstrated. For example, Shaw and Rothman conclude that roles and links are not needed, while Schuller, testing novel Uniterms against UDC, finds Uniterms superior, though he concedes the complementary utility of UDC. However in some cases the results, like those of Cranfield 1[OCRerr], were contrary to expectation, and in the more broadly ranging comparative tests, like Cranfield 2 and CWRU, the results were surprising: in the first that natural language is competitive, and in the second that the indexing language is not very important.