IRE
Information Retrieval Experiment
The Cranfield tests
chapter
Karen Sparck Jones
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
268 The Cranfield tests
evidence to support their own views. The reviews of the final Report focusing
on Cranfield 1 as an experiment and raising methodological questions are
thus of more interest. Some of the points made were sound, but others were
not. The more conspicuous critiques are therefore presented here as they
were originally published, for the historical record, and are then briefly
assessed.
The main critique of Cranfield 1 came from Swanson6, who wrote:
`the significance of these data, in my opinion, should be fully understood
by every student of indexing and classification. Such understanding can be
reached only after discriminating study, however, for the experimental
design of the project, particularly in its early stages, provides a rather
unsteady foundation for the superstructure of conclusions subsequently
erected. . . . The early Cranfield results have by now been extensively
cited, widely quoted out of context, and usually misinterpreted.' (p.1)
Swanson's particular concern, given
`the snowballing tendency to cite [the] results out of context of the
experimental conditions' (p.2),
is with the implications of the project's experimental design for its `findings'.
Thus he says that
`the design itself seems to have guaranteed many of the results that were
found, so that the evidence which supports such results is questionable.
The experimental design in fact was such that certain phenomena no
doubt would have been detected whether they existed in real systems or
not. The "phenomena" to which I refer are the following widely claimed,
widely quoted (and, I think, widely accepted) "findings" of the Cranfield
project' (p.2),
namely that:
(1) indexing times over 4 minutes give no real improvement in
performance;
(2) a high quality of indexing is obtainable from non-technical indexers;
(3) reliable figures for recall and precision have been obtained;
(4) systems operate at recall 7[OCRerr]90 per cent and precision 8-20 per cent;
(5) maximum recall has effectively been reached;
(6) a 1 per cent improvement in precision costs a 3 per cent loss in recall;
(7) no indexing finesses can substantially improve recall without a loss of
precision;
(8) low precision in the WRU index is due to bad searching, though
WRU perfectionism pays no dividends;
(9) there is an inverse relationship between recall and precision;
(10) all four indexing methods give similar performance.
Swanson argues that these statements, whether or not they are true, are not
established by the test, but are products of its design. He singles out in
particular the use of source documents as probably accounting for all of 1-8,
the lack of control on relevance, and the influence of human memory on
indexing and searching. In Swanson's view, the artificiality of the questions
derived from specific documents is less important as bearing on the questions