IRE
Information Retrieval Experiment
The Cranfield tests
chapter
Karen Sparck Jones
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
Cranfield 2 273
that the more complex a language is in terms of recall and precision devices
the greater its range of performance; and that maximum recall depends on
indexing exhaustivity, precision on language specificity. The Cranfield 2 test
was based on the more general propositions from this set not limited to
specific documents and questions, namely Swanson's fourth, fifth and
seventh, and the first one and last three of those just given. Cranfield 2 thus
focused on an analysis of the behaviour of recall and precision devices, and
further, to ensure control, on an analysis of these devices in laboratory
experiments. The Report authors argue robustly for the emphasis on recall
and precision as important to users and difficult to measure; for the resolution,
following Vickery, of indexing languages into their component devices so the
contribution of these to language performance can be assessed; and
`to make advances in knowledge regarding index languages',
for
`a laboratory-type situation, where, freed from the contamination of
operational variables, the performance of index languages could be studied
in isolation.' (p.8)
Thus putting these points more fully, the authors summarise the actual test
objective as follows:
`we started from the belief that all index languages are amalgams of
different kinds of devices. Such devices fall into the two groups of those
which are intended to improve the recall ratio and those which are
intended to improve the precision ratio.... The purpose of the test was to
investigate the effect which each of these devices, alone or in any possible
combination, would have on recall and precision.' (p.17)
Further,
`to enable this to be done, it was essential that it should be possible to hold
everything constant except the one variable being investigated.' (p.17)
The critical factors in the test design were therefore the method of
providing questions, the method of providing relevance judgements, and the
method of providing index descriptions of documents; and what is most
significant about the test design is that the methods of obtaining relevance
information designed to provide a firm foundation for recall and precision
performance figures effectively determined other properties of the test data.
Thus as the authors note, while relevance assessments of output for precision
calculation can be both reliably and readily obtained, adequate recall figures
require exhaustive document assessment, leading to the use of a relatively
small collection. Their view however is that the WRU test had shown that a
small document set could provide sufficient data for analysis. The test
therefore used 1400 documents, along with 279 requests, providing a larger
sample than previous tests. It should be noted that the composition
qfuestion
the document set was determined by the method of obtaining the questions.
The project aim in obtaining the questions and assessments was that these
should be as realistic as possible, though, as the Report authors point out, no
actual set of operational questions was available. The approach adopted was
therefore to ask the authors of research papers to characterize the problem to