IRE Information Retrieval Experiment The Cranfield tests chapter Karen Sparck Jones Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. 270 The Cranfield tests there is no doubt a very definite tendency, and an obvious one, for recall to decline as relevance [i.e. precision] improves and vice versa' (p.17), at least under the conditions that `as the average number of index terms per document is increased the recall ability of the system will also increase (inevitably), but relevance, averaged over many search questions, will tend to decline', and as a search is broadened (e.g. by moving upward in a classification hierarchy), then, of course, recall will improve (inevitably); and there is a tendency' for relevance to decline' (p. 17); and he concludes by saying that his criticisms have been mainly `directed at the inaccurate interpretations and generalisations of the Cranfield data. The value of the project as a whole has been unquestionably great'. (p.18) Other reviewers agreed for example in condemning the use of source documents, and made additional criticisms. For instance Mote7 commented on design failures like the fact that there was interference between the four systems in indexing, on the lack of realism represented by the absence of user/searcher feedback, and on defects in the presentation of the results such as multiple entries for individual queries under different sources of failure. Some of these points were also made by Richmond8, who comments on the consequences of the primary/subsidiary indexing strategy for comparability of the systems, namely that strictly only the 40005000 primarily indexed documents for each system matter. She argues, too, that as indexing times were averaged they are relatively useless. However Richmond's main attack is on the extremely poor presentation of the detailed figures; she points out that `so few of the tables are comparable' (p.308) and that `so many of the factors were not equalised.. . that one wonders how valid the results are' (p.209), and provides many examples of the consequent difficulties of interpretation. She also notes that the main test and subsidiary relevance test results do not match up as the Report text suggests, as the supplementary results in fact show differential loss of performance. Richmond notes that the general conclusions are not dogmatic, the four systems performing somewhat similarly and better than expected, with Uniterm most efficient and facet least: this difference is attributable according to Cleverdon to the depth of indexing allowed by the different languages, or, according to Richmond, to the effects of timing. Richmond concludes that the house is built on rock, rather than sand, but it is not well built: the test is important, but future tests need to be more careful and much better reported. In reviewing Cranfield 1[OCRerr] Sharp9 roundly condemns the source document approach and comments that as the environment becomes more natural, i.e.