IRE Information Retrieval Experiment Laboratory tests of manual systems chapter E. Michael Keen Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. The conclusions of information retrieval testing 143 8.3 The COflClUSiOflS of information retrieval testing has all this testing activity led to a set of general conclusions about the design and operation of information retrieval systems, especially manual ones? It is perhaps no surprise that even the answer to such a question is a matter for opinion and debate. Both the content and the status of general findings are viewed in various ways. Laws, rules or principles? If information retrieval is a behavioural science it is unlikely that inviolate laws await discovery. Researchers have therefore more wisely talked about there being hypotheses, rules or fundamental principles. Cyril Cleverdon23 specified 13 hypotheses arising from Cranfield 1; Gerard Salton set out a set of rules governing automatic text analysis24; and Michael Keen and Jeremy Digger gave ten findings in the form of principles9. Cyril Cleverdon has referred to three principles he regards as fundamental25, which may be stated in the present writer's terms as follows: (I) As a search proceeds and retrieves an increasingly larger number of documents, so the numbers of relevant and irrelevant documents retrieved increase monotonically, as also do the measures of recall and fallout. Because the precision ratio is related to both these measures, there is a high probability that there will be an inverse relationship between recall and precision. (2) If indexing exhaustivity is increased, so will the recall ceiling. For a given desired level of recall there is an optimum level of indexing exhaustivity: below this level recall will suffer, and above it precision will deteriorate. However, the optimum level may have a quite wide range of acceptable values26. (3) If indexing specificity is increased, the precision ratio rises. Specificity may be adjusted either by the semantic specificity of the index terms or the levels of term combination usable in searching. For a given desired level of precision there is an optimum level of specificity, though the range of values is not well understood. The first of these three principles incorporates some of the important qualifications that safeguard a naive view of the recall/precision trade-off, as spelled out by Cyril Cleverdon27, but misunderstandings and disagreement break out from time to time. The writer's view of the more detailed practical findings of manual laboratory tests adds the following ten matters: (4) Different types of classificatory index language do not substantially differ in performance merit (Cranfield 1). (5) Controlled index languages, such as classification, alphabetical headings and multiple entry systems (e.g. Uniterm, thesauri, etc.) differ little in performance (Cranfield 1, Cranfield 2, ISILT, Off-shelf). (6) Index languages uncontrolled at the indexing stage do not have an inferior performance to controlled ones (Cranfield 2, ISILT). (7) Extensive cross-references are not needed for high recall, and there is an optimum level above which precision suffers (Cranfield 1, Cranfield 2, ISILT). (8) Syntactical devices used explicitly in searching (e.g. links, roles,