IRE Information Retrieval Experiment Laboratory tests: automatic systems chapter Robert N. Oddy Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. 160 Laboratory tests: automatic Systems (2) (3) (4) (5) theoretical considerations which motivated the system design. (I am thinking, particularly, of relevance feedback techniques8' 9.) Conceptual simphfication. Very often, extremely complex phenomena, to which the behaviour of real life systems is closely linked, enter the laboratory test as simple abstractions: for example, an information need may become a list of index terms, and a relevance judgement a truth- value. The significance that we place upon test results will depend upon the appropriateness of these abstractions. This problem is not specific to automated laboratory testing, but I wish to mention it here because it is more easily brushed aside when users are not present, reminding one of the simplifications that have been made. Extrapolatjon problems. It is usually necessary to work with small samples of document collections and queries in the laboratory, so that it is feasible to assemble complete relevance judgements, and so that the demands on computer resources are not excessive. The statistical problem of extrapolating from laboratory results to real life systems is severe. This chapter is not directly concerned with the problem (see Chapter 2); but it should be borne in mind in the present context. Technical faults. It is remarkably difficult to ensure the correctness of a computer program of any substantial size, and offen it is not at all obvious from its behaviour that a program is faulty. In principle, therefore, we have the problem of demonstrating the credibility of test results. Communication dzfficulties. A detailed account of the workings of a program makes for very heavy reading, and is offen unnecessary in research papers: a summary, or description by analogy is usually more illuminating. There is, however, a danger of ambiguity if programs so described contain unobvious interpretations of the theory, or ad hoc procedural `theories'. Before I attempt an assessment of the methodology, I shall give a fairly detailed description of the nature of data and programs used in the mainstream of automatic laboratory information retrieval testing. 9.2 The test collection The data used for information retrieval laboratory testing has been the subject of some discussion recently20-22. Retrieval experiments make use of what are commonly referred to as `test collections'. Such data consists of a static collection of document descriptions, queries and relevance judgements. The numbers of documents and queries are usually small so that the labour required for setting up the test collection (particularly for obtaining complete relevancejudgements) is kept within reasonable bounds. This has had serious consequences for the acceptability of the results obtained in laboratory situations, and it is consideration of this problem that has lead Sparck Jones and van Rusbergen23 to propose that a large `ideal' test collection be designed for use by a wide range of experimenters. I shall return to this idea presently, but should like first to deal with the small test collections traditionally used in tests of automatic information retrieval systems. The raw data for a test collection can take a number of forms. The whole