IRE
Information Retrieval Experiment
The Cranfield tests
chapter
Karen Sparck Jones
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
276 The Cranfield tests
example output form. The laboratory mode of testing imposed strict controls
on the environmental and operational factors, and for the stated object of the
test, hardware aspects could be ignored. However the controls imposed on
the environmental and operational factors would naturally play their part in
determining performance and hence must be taken into account in assessing
the test.
`In the artificial environment created for the test it was found that a limited
set of changes [to secondary variables] could be investigated; these
included several sets of questions picked by different criteria, relevance
judgements made in four different grades, collections of three different
sizes and tests in two related but different subject fields.' (p.6)
The search strategies tested were all levels of `blind' co-ordination, and 6
types of selection/combination of the query terms representing more
`intelligent' search rules appropriate to the basic type of language.
Overall, the range of variable value combinations tested was very large,
and the Report authors rightly comment on the
`volume, variety and complexity of the tests (p. 16).'
Thus for a particular query and document set there are search results for the
different co-ordination levels and other search rules applied to a range of
language descriptions representing particular combinations of recall and
precision devices, for different initial indexing exhaustivity levels, and taking
account of several relevance grades.
Unfortunately, the considerable clerical and intellectual effort involved in
doing the tests, combined with a methodological interest in question sets
with specific properties, meant that most tests were not carried out with the
full 279 questions and 1400 documents, or even with the largest subset of 221
questions and the 1400 documents. In fact, once the investigators had
convinced themselves that the smaller sets gave results comparable with the
large ones, many of the tests were done with 42 questions and 200 documents.
A consequence of the various selections was that tests were done not only
with collections having different numbers of requests or documents, but also
with collections of different generality, i.e. relevance density.
The scope of the experiments made the details ofperformance measurement
very tricky, and these are discussed at length in the Report. The problems
involved are both the higher-level ones of the choice of measure, and the
lower-level ones of the application of individual measures to particular data,
with special reference to averaging. At the higher level the project was
naturally inclined to use recall and precision; at the lower, the particular
problem tackled by the project was that presented by averaging searches
conducted at different co-ordination levels, representing one case of the
general problem of dealing with search output not supplied as simple
retrieved document sets. An additional problem for the manually-conducted
Cranfield 2 was that the sheer effort of calculation implied by some
performance representation methods could not be undertaken. However
after trials comparing performance for a pair of languages given by different
methods, it was concluded that the simplest direct averaging, totalling
documents retrieved across co-ordination levels and then deriving recall and
precision, was adequate, and this method was used for the great mass of