IRE Information Retrieval Experiment The Cranfield tests chapter Karen Sparck Jones Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. Cranfield 2 273 that the more complex a language is in terms of recall and precision devices the greater its range of performance; and that maximum recall depends on indexing exhaustivity, precision on language specificity. The Cranfield 2 test was based on the more general propositions from this set not limited to specific documents and questions, namely Swanson's fourth, fifth and seventh, and the first one and last three of those just given. Cranfield 2 thus focused on an analysis of the behaviour of recall and precision devices, and further, to ensure control, on an analysis of these devices in laboratory experiments. The Report authors argue robustly for the emphasis on recall and precision as important to users and difficult to measure; for the resolution, following Vickery, of indexing languages into their component devices so the contribution of these to language performance can be assessed; and `to make advances in knowledge regarding index languages', for `a laboratory-type situation, where, freed from the contamination of operational variables, the performance of index languages could be studied in isolation.' (p.8) Thus putting these points more fully, the authors summarise the actual test objective as follows: `we started from the belief that all index languages are amalgams of different kinds of devices. Such devices fall into the two groups of those which are intended to improve the recall ratio and those which are intended to improve the precision ratio.... The purpose of the test was to investigate the effect which each of these devices, alone or in any possible combination, would have on recall and precision.' (p.17) Further, `to enable this to be done, it was essential that it should be possible to hold everything constant except the one variable being investigated.' (p.17) The critical factors in the test design were therefore the method of providing questions, the method of providing relevance judgements, and the method of providing index descriptions of documents; and what is most significant about the test design is that the methods of obtaining relevance information designed to provide a firm foundation for recall and precision performance figures effectively determined other properties of the test data. Thus as the authors note, while relevance assessments of output for precision calculation can be both reliably and readily obtained, adequate recall figures require exhaustive document assessment, leading to the use of a relatively small collection. Their view however is that the WRU test had shown that a small document set could provide sufficient data for analysis. The test therefore used 1400 documents, along with 279 requests, providing a larger sample than previous tests. It should be noted that the composition qfuestion the document set was determined by the method of obtaining the questions. The project aim in obtaining the questions and assessments was that these should be as realistic as possible, though, as the Report authors point out, no actual set of operational questions was available. The approach adopted was therefore to ask the authors of research papers to characterize the problem to