IRE Information Retrieval Experiment The Cranfield tests chapter Karen Sparck Jones Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. 276 The Cranfield tests example output form. The laboratory mode of testing imposed strict controls on the environmental and operational factors, and for the stated object of the test, hardware aspects could be ignored. However the controls imposed on the environmental and operational factors would naturally play their part in determining performance and hence must be taken into account in assessing the test. `In the artificial environment created for the test it was found that a limited set of changes [to secondary variables] could be investigated; these included several sets of questions picked by different criteria, relevance judgements made in four different grades, collections of three different sizes and tests in two related but different subject fields.' (p.6) The search strategies tested were all levels of `blind' co-ordination, and 6 types of selection/combination of the query terms representing more `intelligent' search rules appropriate to the basic type of language. Overall, the range of variable value combinations tested was very large, and the Report authors rightly comment on the `volume, variety and complexity of the tests (p. 16).' Thus for a particular query and document set there are search results for the different co-ordination levels and other search rules applied to a range of language descriptions representing particular combinations of recall and precision devices, for different initial indexing exhaustivity levels, and taking account of several relevance grades. Unfortunately, the considerable clerical and intellectual effort involved in doing the tests, combined with a methodological interest in question sets with specific properties, meant that most tests were not carried out with the full 279 questions and 1400 documents, or even with the largest subset of 221 questions and the 1400 documents. In fact, once the investigators had convinced themselves that the smaller sets gave results comparable with the large ones, many of the tests were done with 42 questions and 200 documents. A consequence of the various selections was that tests were done not only with collections having different numbers of requests or documents, but also with collections of different generality, i.e. relevance density. The scope of the experiments made the details ofperformance measurement very tricky, and these are discussed at length in the Report. The problems involved are both the higher-level ones of the choice of measure, and the lower-level ones of the application of individual measures to particular data, with special reference to averaging. At the higher level the project was naturally inclined to use recall and precision; at the lower, the particular problem tackled by the project was that presented by averaging searches conducted at different co-ordination levels, representing one case of the general problem of dealing with search output not supplied as simple retrieved document sets. An additional problem for the manually-conducted Cranfield 2 was that the sheer effort of calculation implied by some performance representation methods could not be undertaken. However after trials comparing performance for a pair of languages given by different methods, it was concluded that the simplest direct averaging, totalling documents retrieved across co-ordination levels and then deriving recall and precision, was adequate, and this method was used for the great mass of