IRE Information Retrieval Experiment The Cranfield tests chapter Karen Sparck Jones Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. 274 The Cranfield tests which the paper was addressed as a question, adding supplementary questions to the base one if appropriate, and to indicate which citations were relevant. The document collection was then made up of the pooled referencei of all the question source papers, and further assessments of non-cited paperb were made by the question providers, using the output of a crude screening, done by students on titles, of the whole collection, and also some of the output of a bibliographical coupling run based on the source papers. The authors emphasize that the test design was not perfect, and allow that while the test was concerned with the ability of different indexing systems to retrieve judged relevant documents, the vagueness of the notion of relevance itself could have some hidden influence on the test results. The Report describes the procedures for obtaining questions and assessments in detail: one main and one secondary subject area were chosen for the collection; the questions (and documents) represented a wide range of author types, etc., and the question texts were annotated for more and less important and additional terms; assessments were made using four grades of relevance and one non-relevant. The account of this stage of the test in Volume 1 of the Report is a salutary reminder of the great difficulty and labour of collecting raw test data, and great difficulty of obtaining good data. Altogether the question and assessment sets were plausibly heterogeneous, though it is difficult to know if there were specific biasses in them; the Report describes the various aspects of the sets in considerable detail, with particular emphasis on the status of the questions. In specific response to the criticisms of the Cranfield 1 procedures it is argued that the connection of source document and question is much less narrow than in Cranfield 1, and that the searching results are in any case not affected by the source documents as these were removed from the collection before searching for each query. The treatment of indexing in the test reflected the desire to study devices in a controlled way: thus documents were initially indexed `conceptually', and the common conceptual description was then taken as input to indexing by different languages. In this test, unlike the previous ones, these languages were constructed to embody combinations of devices, and were not simply off-the-shelf. The language devices were characterized as different ways of modifying a simple list of single terms to promote precision or recall. Precision-promoting devices include co-ordination, weighting, links and roles, recall devices, synonym confounding, word form variant confounding, generic hierarchical linkage, and non-generic hierarchic linkage. Biblio- graphic coupling, statistical associations, and superimposed coding are also regarded as precision devices though only the first of these, bibliographic coupling, was tested, in fact outside the main project. As the Report authors say, `we have tried to distinguish the basic device itself, as a method of class definition, from the different ways in which it might be implemented in different index languages. The latter may be regarded as different amalgams of the various devices, with further differences resulting from the various methods of file organisation.' (p.47) The strategy adopted was therefore to take as the base indexing simple natural language, with the document description carefully controlled for exhaustivity and specificity. The controls in fact implied `maximum'