Information Retrieval Experiment

IRE Information Retrieval Experiment The Cranfield tests chapter Karen Sparck Jones Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. 268 The Cranfield tests evidence to support their own views. The reviews of the final Report focusing on Cranfield 1 as an experiment and raising methodological questions are thus of more interest. Some of the points made were sound, but others were not. The more conspicuous critiques are therefore presented here as they were originally published, for the historical record, and are then briefly assessed. The main critique of Cranfield 1 came from Swanson6, who wrote: `the significance of these data, in my opinion, should be fully understood by every student of indexing and classification. Such understanding can be reached only after discriminating study, however, for the experimental design of the project, particularly in its early stages, provides a rather unsteady foundation for the superstructure of conclusions subsequently erected. . . . The early Cranfield results have by now been extensively cited, widely quoted out of context, and usually misinterpreted.' (p.1) Swanson's particular concern, given `the snowballing tendency to cite [the] results out of context of the experimental conditions' (p.2), is with the implications of the project's experimental design for its `findings'. Thus he says that `the design itself seems to have guaranteed many of the results that were found, so that the evidence which supports such results is questionable. The experimental design in fact was such that certain phenomena no doubt would have been detected whether they existed in real systems or not. The "phenomena" to which I refer are the following widely claimed, widely quoted (and, I think, widely accepted) "findings" of the Cranfield project' (p.2), namely that: (1) indexing times over 4 minutes give no real improvement in performance; (2) a high quality of indexing is obtainable from non-technical indexers; (3) reliable figures for recall and precision have been obtained; (4) systems operate at recall 7[OCRerr]90 per cent and precision 8-20 per cent; (5) maximum recall has effectively been reached; (6) a 1 per cent improvement in precision costs a 3 per cent loss in recall; (7) no indexing finesses can substantially improve recall without a loss of precision; (8) low precision in the WRU index is due to bad searching, though WRU perfectionism pays no dividends; (9) there is an inverse relationship between recall and precision; (10) all four indexing methods give similar performance. Swanson argues that these statements, whether or not they are true, are not established by the test, but are products of its design. He singles out in particular the use of source documents as probably accounting for all of 1-8, the lack of control on relevance, and the influence of human memory on indexing and searching. In Swanson's view, the artificiality of the questions derived from specific documents is less important as bearing on the questions