IRE Information Retrieval Experiment The methodology of information retrieval experiment chapter Stephen E. Robertson Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. Statistical ideas and questions 29 l\l)()thCSiS on the basis of a single test, one test is certainly an insufficient l),[OCRerr]sis lor acceptance: one must look for a number of different ways to test a lIvI)()thesis before accepting it, even if only provisionally. In information I([OCRerr]I[ieval, this has generally meant testing on several different test collections (I) (locuments, queries and relevance judgements). The reason for this form iiiultiple testing is that the most obvious variable (which could cause a lI\l)()thesis which works under some conditions to fail under others) is [OCRerr].IIl)jcct: the different test collections are usually in different subject areas. But Ii(tle attention has as yet been paid to other variables which might cause l)I()blems, such as document or query type, or heterogeneity of the document ([OCRerr])lIection in terms of subject matter or date. This lack is partly a function of vi ilability of resources: as discussed above, test facilities which would allow `[OCRerr]iich tests to be made do not exist at present and would be expensive to set up. As I have indicated, this scarcity of results from laboratory tests on the v,[OCRerr]irious variables associated with document and query collections which iii ight influence the results of retrieval tests is also unfortunate from the point [OCRerr]l' view of operational system testers. It is to be hoped that more work will be [OCRerr]l()ne on these problems. l'.xperimental design far, I have assumed the problem to be: `Given the results of this test, what cm we infer?'. But one can also approach the statistical aspects from the opposite direction: `Given the sort ofinferences Tam looking for, how should I design my test to ensure that I get suitable results'?'. The obvious and commonest application of this idea is to sample size. Suppose that we want to ensure (at least to a certain level of confidence) that, ifsystem A really performs so much better than system B, then the test results will lead to the correct inference. Assuming we know in advance which significance test we are going to use, and something about the distributions [OCRerr] the variables we are measuring, then it is possible to specify a minimum sample size to achieve this aim. Because of the difficulties of finding suitable methods, few testers actually do statistical significance tests, let alone define the minimum sample size in advance. So this kind of procedure is not yet common in retrieval tests, though it should become more so. A second procedure common in experimental design generally is concerned with the control of variables. Suppose that we are to do a test involving a small number of searchers (intermediaries) on a number of different systems. The object of the exercise is to compare the systems, but it may be that the choice of searcher will have a strong influence on the results for an individual query. Further, this influence may depend on the combination of searcher and system, rather than just the searcher. So we must devise a method for ensuring that the variations between searchers do not in any way distort the comparison between the systems. There are well established methods, such as Latin square designs, for coping with this kind of problem; some such methods have been used to good effect in retrieval tests. Again, suppose we are testing alternative relevance feedback procedures. The problem is to isolate, in some way, the effect of the relevance feedback from the performance of the system without feedback. This is not an entirely