IRE Information Retrieval Experiment The methodology of information retrieval experiment chapter Stephen E. Robertson Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. 24 The methodology of information retrieval experiment requests, and the entire existing document collection, then inferences can be made by standard statistical techniques (such as significance tests). Indeed, we can to some extent reverse this procedure, and calculate what sample size is required in order to establish a certain difference between the two systems at a given level of confidence. Unfortunately, the situation is rarely so simple. The complications, as can be guessed from the specification of the simple case, are many and various. To a large extent, the problems are as yet unsolved; some of them admit (in principle, at least) of a statistical solution; some of them would certainly require other ideas to be combined with the statistical ones, ideas which might for example be described as linguistic, psychological, epistemological or even simply retrieval-theoretic'. Two populations I assumed in the simple case that, in moving from the situation we are measuring to the situation about which we wish to make inferences, the set of queries changes but the set of documents remains the same. It is possible to imagine an experiment in which the two roles are reversed: an experiment concerned with certain specified SDI queries, with the document collection being completely new each month. In such a case, we would regard the document collection as a sample and make statistical inferences accordingly. But far more commonly, we have the situation in which neither the query set nor the document collection remains the same. Even in most straightfor- ward tests on operational systems, the document collection changes more or less gradually with time; and one is seldom in a position where one wants to know only about existing queries. So the normal case is one in which we have to consider both the test set of queries and the test collection of documents as (in some sense) samples from a population. Suppose, then, that we can regard both samples as random: that is, in both cases, the sample is representative of the population, with no systematic differences or biases. In these circumstances, can we call in standard statistical techniques in order to make inferences about the two populations and their interactions from the measurements that we make on the samples? Even for this (still comparatively simple) case, the answer is no: although in principle the problem remains a purely statistical one, very little exists in the way of standard methods which are formally valid under such conditions. As a result, many testers have tried to apply statistical methods which assume only one sampling process, and have simply ignored the second. Early work on these lines tended to use the document as the critical unit: that is, to regard the test collection of documents as a random sample from a population, and to ignore the problem in connection with requests. However, more recent work has tended to follow the reverse view. There are two reasons for this change. The first is that some of the measurements that have been used are query-oriented, and in order to make any inferences at all with such measures one must consider the queries as a sample (whatever one does about the documents). The second is that in general, the number of queries tends to be a much more critical quantity than the number of documents: for reasons which will be clear from earlier discussions, the tester usually has access to many more documents than requests. I I m I I I