IRE Information Retrieval Experiment The Cranfield tests chapter Karen Sparck Jones Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. Criticisms of Cranfield 2 281 while Rees24 remarks that `the problem of a criterion measure remains in that Cleverdon's measure reflects the overall or ultimate performance of the system or subsystem tested. The sources of variation affecting performance are not adequately pinpointed, and small indication is given as to how to optimize performance.' (p.68) In Rees' view the basic assumption about relevance underlying Cranfield 2 had not been seriously questioned by 1967, while the methodology of the test was not blatantly defective; he notes that the project was not, unlike that at Case Western Reserve, regarded as having the explicit aim of developing test methodologies. He implies that the results are not seriously suspect, but at the same time argues that `the generalisability of these findings, and the problem of optimising system performance, remai[OCRerr] `(p.68) He also comments on the difficulty of replicating the results. Assessing these criticisms of Cranfield 2, it is apparent both that as Cranfield 2 was methodologically superior to Cranfield 1 the scope for criticism was reduced and that greater familiarity with the requirements and constraints of testing meant that some criticisms were more usefully pointed. As before, some criticsms seem to have been fundamentally mistaken, like Sharp's condemnation of the Report's careful statement of the recall/precision relationship. The more plausible criticisms again fall into three groups. Vickery's remark that the test did not reflect an ordinary operating system situation, like Mote's earlier, is inappropriate to an explicitly laboratory test. Swanson's and Harter's claims about the existence of many more relevant documents than were used are themselves open to a good deal of doubt; they fall into the class of speculative criticisms. On the other hand, their point about the assessment procedure is more substantial, though there is no evidence that, while the procedure could have affected the test results, it actually did so. Both Cranfield 1 and Cranfield 2 were comparative tests and it is therefore necessary, in reviewing criticisms of the two experiments, to distinguish features of the design and conduct of the tests which could conceivably have affected comparative performance from those which were most unlikely in fact to have done so. Many of the criticisms of both tests failed to take this distinction into account. At the same time, the possibility that hidden factors may affect performance has to be raised in relation to every test. The real defects of Cranfield 2 were the lack of statistical tests, noted by Vickery, and the failure to develop criterion measures pointed out by Rees. among the comments on Cranfield 2, Rees' display the most insight, OaCverall~ nd correctly point the way forward for future tests building on both ranfield 1 and Cranfield 2. The relation to Cranfield 2 seems to have been rather less hostile than that to Cranfield 1. There were probably several reasons for this. First, the test was not manifestly open to major methodological criticisms like Cranfield 1 (Swanson's and Harter's papers were not published till five years later). In this connection it is worth noting that a subsequent test by Cleverdon with