IRE Information Retrieval Experiment Retrieval system tests 1958-1978 chapter Karen Sparck Jones Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. 222 Retrieval system tests 1958-1978 like Cranfield 1 or Cranfield 2 could be described as neutral with respect to specific type of language, while assuming that some sophisticated language was required. (In Cranfield 2 simple natural language was taken as a base for improvement in sophistication by the application of various devices.) In form, these tests had much in common, perhaps not altogether surprisingly in view of the influence of Cranfield both organizationally, as in the link with CW RU, and intellectually, as in the application of Cranfield methods in Herner et al.'s Bureau of Ships test. The data for the tests generally consisted of less than 100 requests, and some hundred or a few thousand documents. Lancaster's Medlars test was quite exceptional in scale with over 300 requests and over half a million documents. Otherwise only Cranfield 1, Sinnett, and Cohen et al. used more than 5000 documents (though several experimental reports do not indicate how many documents were used). Queries varied, in some cases being genuine user queries, in others pseudo queries, and in yet others, following Cranfield 1, ones specifically based on a source document. On the whole there does not seem to have been much negotiation about the query with the user. Relevance assessments were usually made by requesters, and were normally of search output, perhaps pooled from several alternative searches; one or perhaps two grades of relevance were typical. Evaluation of any particular match output was commonly by precision, and by recall (or sensitivity as CWRU called it), though the CWRU tests substituted specificity for precision, and Sinnett noise. Blagden used noise alone, while Lancaster added novelty to recall and precision. In some cases simple numbers of relevant and non-relevant documents retrieved were used. CWRU combined sensitivity and specificity in a single measure of effectiveness. Recall was normally calculated relative to some subset of the possible relevant documents, say those identified by assessing some or all of the pooled output of alternative searches, or by assessing an independently obtained collection subset. With very few exceptions, like most of the Cranfield 2 performance characterizations, performance was calculated for simple sets of retrieved documents, giving one figure for each measure and so, for example, a pair of precision and recall values, for each particular test option. With respect to the test results, again looking at the tests substantively rather than methodologically, the most striking feature of the actual findings for the comparative tests was the very wide variation in performance. This is true both of individual studies, and, insofar as such cross comparisons are legitimate, of groups of similar tests. Variations in the findings obtained by different projects have to be treated with reserve, since they may be attributed as much to specific measurement procedures or data statistics as to the system factors, especially languages and their application, being studied, or to their environment. In particular, variations in relative recall for different projects, especially those using pooled output, are only of real significance within the context of individual projects. The fact that even the more plausibly grouped tests may differ in detail, for example by using average of numbers rather than average of ratios, or in using an external sample rather than pooled output for relative recall, and that in addition I have worked some figures myself, might suggest that there is no point in giving specific findings. But this is worth doing, to give the real flavour of the tests. It should however be noted that as many tests consisted