IRE Information Retrieval Experiment The Cranfield tests chapter Karen Sparck Jones Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. 264 The Cranfield tests with the proviso that extra searching may be needed, though he acknowledgcb that the main test results could be taken as indicating that the test questioni were slanted towards the source documents. In commenting on the test results as a whole Cleverdon notes that, when allowance is made for standard error, the systems could differ by as little i'i 3.2 per cent, or as much as 13.7 per cent. However, taking the results at their face value, and considering those for the final subprogramme, Uniterm appears 3.8 per cent (actually 3.9 per cent) superior to alphabetical, the latter 5.3 per cent superior to the UDC, and UDC 3.5 per cent superior to facet. The four languages also had some individual advantages and disadvantage[OCRerr], particular points of interest being `the great value and importance of the alphabetical index for the [UDC] schedules' (p.90), the fact that the alphabetical system `was far more effective than had been expected' (p.91), and that `Uniterm, as a descriptor language, can be given a high rating on many counts. It achieved the best overall figures in the test,. . . it appears to have as good a relevance figure as any other system, and... it did not compare unfavourably in the recall of non-source relevant documents.' (p.92) With respect to times, longer times raised the success rate from 72.9 to 84.3 per cent over all systems. The results did not show marked differences between indexers or subjects, nor improvements due to learning in indexing, except for Uniterm, or to learning in searching, except for UDC. In discussing the design of search programmes Cleverdon says that `we were not prepared to spend the length of time in physical searching which some organisations appear willing to do' (p.87); further, if results are to be produced quickly, `the formulation of [searchj programmes must be a reasonably straightfor- ward matter, and this was the position with the project searches.' (p.88) This is emphasized by the fact that most failures were due to indexing. Finally, Cleverdon maintains that the main and other tests together `have shown that the general working level of I.R. systems appears to be in the general area of 60%[OCRerr]90o/[OCRerr] recall and l0%[OCRerr]25o/o of relevance,... This is a considerable distance away from the oft-made assertion that systems are operating in the general area of [both high recall and high precisionj.' [OCRerr]. 89) Further, `it can now be said that the inverse relationship between recall and relevance has been conclusively shown, and it should now be possible to design and operate systems that will satisfy, in the most economic way, stated requirements. There will be situations where the emphasis must be on the highest possible recall level, and the resulting penalty of the low