IRE
Information Retrieval Experiment
The Cranfield tests
chapter
Karen Sparck Jones
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
264 The Cranfield tests
with the proviso that extra searching may be needed, though he acknowledgcb
that the main test results could be taken as indicating that the test questioni
were slanted towards the source documents.
In commenting on the test results as a whole Cleverdon notes that, when
allowance is made for standard error, the systems could differ by as little i'i
3.2 per cent, or as much as 13.7 per cent. However, taking the results at their
face value, and considering those for the final subprogramme, Uniterm
appears 3.8 per cent (actually 3.9 per cent) superior to alphabetical, the latter
5.3 per cent superior to the UDC, and UDC 3.5 per cent superior to facet.
The four languages also had some individual advantages and disadvantage[OCRerr],
particular points of interest being
`the great value and importance of the alphabetical index for the [UDC]
schedules' (p.90),
the fact that the alphabetical system
`was far more effective than had been expected' (p.91),
and that
`Uniterm, as a descriptor language, can be given a high rating on many
counts. It achieved the best overall figures in the test,. . . it appears to have
as good a relevance figure as any other system, and... it did not compare
unfavourably in the recall of non-source relevant documents.' (p.92)
With respect to times, longer times raised the success rate from 72.9 to 84.3
per cent over all systems. The results did not show marked differences
between indexers or subjects, nor improvements due to learning in indexing,
except for Uniterm, or to learning in searching, except for UDC. In
discussing the design of search programmes Cleverdon says that
`we were not prepared to spend the length of time in physical searching
which some organisations appear willing to do' (p.87);
further, if results are to be produced quickly,
`the formulation of [searchj programmes must be a reasonably straightfor-
ward matter, and this was the position with the project searches.' (p.88)
This is emphasized by the fact that most failures were due to indexing.
Finally, Cleverdon maintains that the main and other tests together
`have shown that the general working level of I.R. systems appears to be
in the general area of 60%[OCRerr]90o/[OCRerr] recall and l0%[OCRerr]25o/o of relevance,... This
is a considerable distance away from the oft-made assertion that systems
are operating in the general area of [both high recall and high precisionj.'
[OCRerr]. 89)
Further,
`it can now be said that the inverse relationship between recall and
relevance has been conclusively shown, and it should now be possible to
design and operate systems that will satisfy, in the most economic way,
stated requirements. There will be situations where the emphasis must be
on the highest possible recall level, and the resulting penalty of the low