IRE Information Retrieval Experiment The Cranfield tests chapter Karen Sparck Jones Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. 280 The Cranfield tests notes that normalized recall, used for the overall language ranking, is not very realistic as it does not depend on the cutoff point. At the same time he remarks with respect to the overall observations based on the ranking, namely that normalized recall is low for concepts as opposed to single terms, that it is low for generic single terms, and that it rises for generic concepts, that the third observation is widely accepted, that the second is not unexpected, but that the first is unexpected and so requires specific refutation by the advocates of controlled languages. In Vickery's view, `the volumes of this report are an impressive account of a complex piece of research, undertaken with care and diligence. They give no final answers, and their conclusions must be treated with caution, but they are a valuable exploration of the retrieval process.' (p.340) Following the line of his attack on Cranfield 1, Swanson21 suggests that the relation between questions and relevant documents is much too close, due to some features of the assessment procedure, especially the second step: thus the students screening for additional documents relevant to a question, used documents already known to be relevant to the question, while the user was allowed to modify his initial query after assessing the extra documents. Swanson indeed maintains that in the initial question provision the documents cited by the source paper, having been read, could have influenced the verbalization of the question. However a more serious criticism, in Swanson's view, is implied by some facts about the additional relevant documents: namely that bibliographic coupling gave a good many relevant documents not identified by the student screeners. This suggests that a large number of relevant documents were in fact missed altogether, the explanation being that the students were poor screeners as they only used titles, while the bibliographic coupling was only done at a high level. The implication is that the results for the whole set of experiments may be unreliable. This point is considered in detail by Harter22, who seeks to show, by formal arguments applied to real data, first that a good many relevant documents were missed; second, that changes in the relative proportions of missed to non-missed can affect recall/precision point values (as well as values over a cutoff range), with the important consequence that relative performance ratings can change; and third, using additional relevance data for a sample of languages, that the picture of relative language merit given by Cranfield 2 for these languages is changed when the additional relevance information is utilized. Unfortunately, while Harter's general point is sound, he indulges in some wild statistical extrapolation and very speculative global statements about the Cranfield results. He makes a good case for the principle that there may well be missed relevant documents unless evaluation is truly exhaustive and that omissions can affect performance, but his actual investigation of the data suggests that it is mostly the less important result for the title language, for which test biasses are evident, that was really affected in practice. Other more general comments were made by, for example, Sharp and Rees. Sharp23 notes that `Cleverdon et al. . . . have qualified the basic recall/relevance thesis so that its application now seems so limited as to be confined to those conditions where. . . its truth is obvious.' (p.92)