IRE Information Retrieval Experiment Laboratory tests of manual systems chapter E. Michael Keen Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. I 138 Laboratory tests of manual Systems only require measures of hits and waste, as Cranfield 2, 50 this criterion for test validity is to be related to the objectives of a given test. A second desideratum applies to tests in which performance is being compared under different circumstances, and is that the comparison be so controlled that the cause of any performance difference may be determined. This may not always require no more than a single variable to be altered at any one time, but in the present state of information retrieval theory this is usually the safest procedure. Another aspect of this requirement is that where possible the test environment factors should be held constant; for example, a comparison of manual boolean logic search results with those of a search path from a ranked search output would be best made on one test collection using the same set of search requests and relevance decisions. It is recognized that some comparisons cannot be made without some change in environment factors, such as the comparison of general and specific requests requiring different request sets, but then there would be an advantage in keeping the document collection unchanged. A third criterion for acceptability is the practical matter of the availability of a full report describing the test in sufficient detail. There have been suggested minimum lists of matters to be included in reporting evaluation tests but nothing has gained acceptance. If the method used for some vital part of an experiment cannot be determined, then its results are really as suspect as those from tests known to be inadequate. 8.2 Test types The history of laboratory manual testing seems to consist of only a few large studies, each one looking at a number of the basic parameters that govern the behaviour of information retrieval systems. Few hypotheses have been clearly formulated, but these tests constitute a host of quite tight experiments that have given us most of the light we have on index languages, indexing and searching. Examples of tests will now be given, categorized for convenience into index language comparisons, indexing and searching experiments and printed index comparisons. The writer's own work will often be used as the main illustrations of these distinctive test types, so other studies would need to be added for a comprehensive picture. Some of the findings and conclusions of this testing activity will be given in the next section. Index language comparisons Cranfield 1 remains a classic set of experiments in objectives, details and procedures1' 2 [OCRerr] provided all the necessary and sufficient circumstances for testing. All subsequent tests have, knowingly or unknowingly, faced the same problems, but rarely with the common sense and ingenuity of Cyril Cleverdon. What was tested has already been briefly described. Overlapping it in time was a test of a faceted classification used manually, versus a complex semantic code and role operator system with machine searching, known as the Western Reserve University test8. Here Cyril Cleverdon and Jean Aitchison showed that a small test collection in laboratory search 1+I