IRE
Information Retrieval Experiment
The Cranfield tests
chapter
Karen Sparck Jones
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
270 The Cranfield tests
there is no doubt a very definite tendency, and an obvious one, for recall to
decline as relevance [i.e. precision] improves and vice versa' (p.17),
at least under the conditions that
`as the average number of index terms per document is increased the recall
ability of the system will also increase (inevitably), but relevance, averaged
over many search questions, will tend to decline',
and
as a search is broadened (e.g. by moving upward in a classification
hierarchy), then, of course, recall will improve (inevitably); and there is a
tendency' for relevance to decline' (p. 17);
and he concludes by saying that his criticisms have been mainly
`directed at the inaccurate interpretations and generalisations of the
Cranfield data. The value of the project as a whole has been unquestionably
great'. (p.18)
Other reviewers agreed for example in condemning the use of source
documents, and made additional criticisms. For instance Mote7 commented
on design failures like the fact that there was interference between the four
systems in indexing, on the lack of realism represented by the absence of
user/searcher feedback, and on defects in the presentation of the results such
as multiple entries for individual queries under different sources of failure.
Some of these points were also made by Richmond8, who comments on the
consequences of the primary/subsidiary indexing strategy for comparability
of the systems, namely that strictly only the 40005000 primarily indexed
documents for each system matter. She argues, too, that as indexing times
were averaged they are relatively useless. However Richmond's main attack
is on the extremely poor presentation of the detailed figures; she points out
that
`so few of the tables are comparable' (p.308)
and that
`so many of the factors were not equalised.. . that one wonders how valid
the results are' (p.209),
and provides many examples of the consequent difficulties of interpretation.
She also notes that the main test and subsidiary relevance test results do not
match up as the Report text suggests, as the supplementary results in fact
show differential loss of performance. Richmond notes that the general
conclusions are not dogmatic, the four systems performing somewhat
similarly and better than expected, with Uniterm most efficient and facet
least: this difference is attributable according to Cleverdon to the depth of
indexing allowed by the different languages, or, according to Richmond, to
the effects of timing. Richmond concludes that the house is built on rock,
rather than sand, but it is not well built: the test is important, but future tests
need to be more careful and much better reported.
In reviewing Cranfield 1[OCRerr] Sharp9 roundly condemns the source document
approach and comments that as the environment becomes more natural, i.e.