IRE
Information Retrieval Experiment
The Cranfield tests
chapter
Karen Sparck Jones
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
274 The Cranfield tests
which the paper was addressed as a question, adding supplementary
questions to the base one if appropriate, and to indicate which citations were
relevant. The document collection was then made up of the pooled referencei
of all the question source papers, and further assessments of non-cited paperb
were made by the question providers, using the output of a crude screening,
done by students on titles, of the whole collection, and also some of the output
of a bibliographical coupling run based on the source papers.
The authors emphasize that the test design was not perfect, and allow that
while the test was concerned with the ability of different indexing systems to
retrieve judged relevant documents, the vagueness of the notion of relevance
itself could have some hidden influence on the test results. The Report
describes the procedures for obtaining questions and assessments in detail:
one main and one secondary subject area were chosen for the collection; the
questions (and documents) represented a wide range of author types, etc.,
and the question texts were annotated for more and less important and
additional terms; assessments were made using four grades of relevance and
one non-relevant. The account of this stage of the test in Volume 1 of the
Report is a salutary reminder of the great difficulty and labour of collecting
raw test data, and great difficulty of obtaining good data.
Altogether the question and assessment sets were plausibly heterogeneous,
though it is difficult to know if there were specific biasses in them; the Report
describes the various aspects of the sets in considerable detail, with particular
emphasis on the status of the questions. In specific response to the criticisms
of the Cranfield 1 procedures it is argued that the connection of source
document and question is much less narrow than in Cranfield 1, and that the
searching results are in any case not affected by the source documents as
these were removed from the collection before searching for each query.
The treatment of indexing in the test reflected the desire to study devices
in a controlled way: thus documents were initially indexed `conceptually',
and the common conceptual description was then taken as input to indexing
by different languages. In this test, unlike the previous ones, these languages
were constructed to embody combinations of devices, and were not simply
off-the-shelf. The language devices were characterized as different ways of
modifying a simple list of single terms to promote precision or recall.
Precision-promoting devices include co-ordination, weighting, links and
roles, recall devices, synonym confounding, word form variant confounding,
generic hierarchical linkage, and non-generic hierarchic linkage. Biblio-
graphic coupling, statistical associations, and superimposed coding are also
regarded as precision devices though only the first of these, bibliographic
coupling, was tested, in fact outside the main project. As the Report authors
say,
`we have tried to distinguish the basic device itself, as a method of class
definition, from the different ways in which it might be implemented in
different index languages. The latter may be regarded as different
amalgams of the various devices, with further differences resulting from
the various methods of file organisation.' (p.47)
The strategy adopted was therefore to take as the base indexing simple
natural language, with the document description carefully controlled for
exhaustivity and specificity. The controls in fact implied `maximum'