IRE
Information Retrieval Experiment
The Cranfield tests
chapter
Karen Sparck Jones
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
266 The Cranfield tests
for these. The assessments allowed two grades of relevance, for documents as
relevant as the source, and for less relevant documents. The results, for both
grades and including the source documents, were 75.8 per cent recall for the
WRU index, with 17.7 per cent precision, compared with recall of 69.5 per
cent and precision of 33.7 per cent for the Cranfield facet index. (Removing
the source documents from the calculations reduces recall to 70.6 per cent
and 59.1 per cent respectively, with 13.0 per cent and 24.0 per cent precision
(my figures using Ref. 4pp. 12-15 and Appendix 3C; there appear to be some
discrepancies in the various figures in Ref. 4).) A detailed analysis of the
failures showed the searching was responsible for most, namely 67.1 per cent,
with indexing 18.4 per cent. In considering the results. Aitchison and
Cleverdon say that
`before the test started, we were convinced that W.R.U. would be able to
achieve one of three results.
(a) obtain a high recall figure;
(b) obtain a high relevance [i.e. precision] figure;
(c) obtain a recall figure and a relevance figure which would both be
somewhat higher than that achieved with the Cranfield facet index.
The high level of exhaustivity of the indexing and the complex semantic
factoring in the index language gave them the ability to achieve (a); the
specific index language, with the added controls would allow them to
achieve (b); the combination of these factors could bring about (c).' (p.47)
The poor WRU precision figure was then explained by two factors:
`the main factor in W.R.U. failures to retrieve relevant documents was the
relatively poor standard of many of their search programmes,' (p.48)
while an investigation of non-relevant documents showed that
`the high level of exhaustive indexing was partly to blame' (p.48)
They comment on various features of the test emphasizing the inverse
recall/precision relationship, and note that the test influenced the work on
Cranfield 2, then under way, in sharpening the idea of index language device.
Cranfleld 1 summarized
The main summary account of Cranfield 1 was the Lancaster and Mills5
account of 1964, which also discussed the English Electric and Cranfield lj
tests. The account emphasizes the need for the study of indexing itself rather
than the manipulation of its products in searches, and comments on the
critical role of the Aslib Cranfield project in this. In the Lancaster and Mills'
view the most significant results were related to recall, showing its inverse
relation to precision and comparable performance for the languages studied
(including facets when implemented rather differently from the main test);
for indexing times, showing a short time is good enough; for indexers,
showing technical knowledge of the subject is not necessary; and for failures,
showing inevitable human error to be important. Lancaster and Mills accept
Cleverdon's conclusion that the `artificial' questions did not invalidate the
results, and themselves conclude that they were not affected by other factors
like the stopwatch indexing. In Lancaster and Mills' view the English