IRE
Information Retrieval Experiment
The Cranfield tests
chapter
Karen Sparck Jones
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
278 The Cranfield tests
based on simulated ranking was to provide single normalized recall figures
for each language, following the Smart model. this supplies the overall merit
ordering of the languages given in the key synoptic table, Figure 8. iT. This
shows a best normalized value of 65.82 for the single term language with
word forms conflated, and a worst of 44.64 for the simple concepts. More
globally, all but one of the single term languages are placed first in the list,
with normalized recall ranging from 65.82 down to 63.05, followed by two
concept languages and then all the controlled languages, these with
normalized recall from 61.76 to 59.17, followed by all the remaining concept
languages. Abstracts and titles are comparable with controlled terms. A
variety of subsidiary analyses show, for example, that absolute performance
for different relevance grades varies, but that the inverse recall/precision
relationship is maintained. Furthermore, it appeared that better performance
was obtained for lower generality questions; that the basic questions
performed better than the supplementary; and that different subject areas
probably affect absolute if not relative performance.
From this mass of detailed results the Report authors draw two main
conclusions. First, that
every set of figures supports the original hypothesis of an inverse
relationship between recall and precision. It is immaterial which variable
is changed to give a new system; it may be the coordination level. . ., the
exhaustivity of indexing . . ., the recall devices . . ., the precision devices
the search programmes. . ., or the relevance decisions. . . ; it has been
impossible to find any exception to what can be claimed as a basic rule.'
(p.252)
Second, that
`quite the most astonishing and seemingly inexplicable conclusion that
arises from the project is that the single term index languages are superior
to any other type.' (p.252)
With respect to the different language groups the authors conclude that
there was an optimum level of specificity: the initial simple concepts were
over-specific, so performance improved as the terms were broadened; the
single terms were about right, so broadening degraded performance; and the
controlled language came between the two, so broadening depressed
performance, but only moderately. Thus more specifically, the authors
concluded that:
`(1) In the environment of this test, it was shown that the best performance
was obtained by the use of Single Term index languages.
(2) With these Single Term index languages, the formation of groups of
terms or classes beyond the stage of true synonyms or word forms
resulted in a drop of performance.
(3) The use of precision devices such as partitioning and interfixing was
not as effective as the basic precision device of coordination.' (p.255)
The authors then consider whether the test environment was responsible
in some specific way for the results. For as they say, their conclusion that the
single term languages are superior