IRE
Information Retrieval Experiment
The Cranfield tests
chapter
Karen Sparck Jones
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
Criticisms of Cranfield 2 281
while Rees24 remarks that
`the problem of a criterion measure remains in that Cleverdon's measure
reflects the overall or ultimate performance of the system or subsystem
tested. The sources of variation affecting performance are not adequately
pinpointed, and small indication is given as to how to optimize
performance.' (p.68)
In Rees' view the basic assumption about relevance underlying Cranfield 2
had not been seriously questioned by 1967, while the methodology of the test
was not blatantly defective; he notes that the project was not, unlike that at
Case Western Reserve, regarded as having the explicit aim of developing test
methodologies. He implies that the results are not seriously suspect, but at
the same time argues that
`the generalisability of these findings, and the problem of optimising
system performance, remai[OCRerr] `(p.68)
He also comments on the difficulty of replicating the results.
Assessing these criticisms of Cranfield 2, it is apparent both that as
Cranfield 2 was methodologically superior to Cranfield 1 the scope for
criticism was reduced and that greater familiarity with the requirements and
constraints of testing meant that some criticisms were more usefully pointed.
As before, some criticsms seem to have been fundamentally mistaken, like
Sharp's condemnation of the Report's careful statement of the recall/precision
relationship. The more plausible criticisms again fall into three groups.
Vickery's remark that the test did not reflect an ordinary operating system
situation, like Mote's earlier, is inappropriate to an explicitly laboratory test.
Swanson's and Harter's claims about the existence of many more relevant
documents than were used are themselves open to a good deal of doubt; they
fall into the class of speculative criticisms. On the other hand, their point
about the assessment procedure is more substantial, though there is no
evidence that, while the procedure could have affected the test results, it
actually did so. Both Cranfield 1 and Cranfield 2 were comparative tests and
it is therefore necessary, in reviewing criticisms of the two experiments, to
distinguish features of the design and conduct of the tests which could
conceivably have affected comparative performance from those which were
most unlikely in fact to have done so. Many of the criticisms of both tests
failed to take this distinction into account. At the same time, the possibility
that hidden factors may affect performance has to be raised in relation to
every test.
The real defects of Cranfield 2 were the lack of statistical tests, noted by
Vickery, and the failure to develop criterion measures pointed out by Rees.
among the comments on Cranfield 2, Rees' display the most insight,
OaCverall~
nd correctly point the way forward for future tests building on both
ranfield 1 and Cranfield 2.
The relation to Cranfield 2 seems to have been rather less hostile than that
to Cranfield 1. There were probably several reasons for this. First, the test
was not manifestly open to major methodological criticisms like Cranfield 1
(Swanson's and Harter's papers were not published till five years later). In
this connection it is worth noting that a subsequent test by Cleverdon with