IRE Information Retrieval Experiment The Cranfield tests chapter Karen Sparck Jones Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. 278 The Cranfield tests based on simulated ranking was to provide single normalized recall figures for each language, following the Smart model. this supplies the overall merit ordering of the languages given in the key synoptic table, Figure 8. iT. This shows a best normalized value of 65.82 for the single term language with word forms conflated, and a worst of 44.64 for the simple concepts. More globally, all but one of the single term languages are placed first in the list, with normalized recall ranging from 65.82 down to 63.05, followed by two concept languages and then all the controlled languages, these with normalized recall from 61.76 to 59.17, followed by all the remaining concept languages. Abstracts and titles are comparable with controlled terms. A variety of subsidiary analyses show, for example, that absolute performance for different relevance grades varies, but that the inverse recall/precision relationship is maintained. Furthermore, it appeared that better performance was obtained for lower generality questions; that the basic questions performed better than the supplementary; and that different subject areas probably affect absolute if not relative performance. From this mass of detailed results the Report authors draw two main conclusions. First, that every set of figures supports the original hypothesis of an inverse relationship between recall and precision. It is immaterial which variable is changed to give a new system; it may be the coordination level. . ., the exhaustivity of indexing . . ., the recall devices . . ., the precision devices the search programmes. . ., or the relevance decisions. . . ; it has been impossible to find any exception to what can be claimed as a basic rule.' (p.252) Second, that `quite the most astonishing and seemingly inexplicable conclusion that arises from the project is that the single term index languages are superior to any other type.' (p.252) With respect to the different language groups the authors conclude that there was an optimum level of specificity: the initial simple concepts were over-specific, so performance improved as the terms were broadened; the single terms were about right, so broadening degraded performance; and the controlled language came between the two, so broadening depressed performance, but only moderately. Thus more specifically, the authors concluded that: `(1) In the environment of this test, it was shown that the best performance was obtained by the use of Single Term index languages. (2) With these Single Term index languages, the formation of groups of terms or classes beyond the stage of true synonyms or word forms resulted in a drop of performance. (3) The use of precision devices such as partitioning and interfixing was not as effective as the basic precision device of coordination.' (p.255) The authors then consider whether the test environment was responsible in some specific way for the results. For as they say, their conclusion that the single term languages are superior