IRE Information Retrieval Experiment The methodology of information retrieval experiment chapter Stephen E. Robertson Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. Statistical ideas and questions 27 irticles) as our test collection of documents, but suppose that five years hence I lie proportion of journal articles will be more like 50 per cent. System A, as it happens, is based on the title of documents, whereas system l[OCRerr] involves some intellectual indexing. Because research reports are on the whole longer and more substantial documents than journal articles, they are represented (on the whole) by more index terms in system B; but their titles .[OCRerr]re of very similar length, so in A the two types of documents tend to have .[OCRerr]irnilar size representations. Under these conditions, we might surmise, system B is a good deal more expensive than system A, and works considerably better on reports but rotighly the same on articles. Thus our test will show a marginal performance .I(lvantage to B, but at greatly increased cost; on a cost-effectiveness basis, we iii ight well feel justified in choosing A. But as the proportion of research reports rises in the future, the average performance difference between the systems will increase. So we may have made a mistake, as far as the situation in five years' time is concerned. The questions that arise from this example are: how could we detect this change in the makeup of the collection; how could we assess its importance; md how could we make appropriate adjustments to our results. These (luestions are closely connected because we are only interested in looking for changes that may be important. The problem is, we have little idea of which variables may have major effects. Below, I discuss the paucity of results from Itboratory tests that might help in this situation. So, for the tester of operational systems, the only way ahead is to make a [OCRerr]tiess at any variables that may be important. The question of how to detect changes in these variables is clearly one of observation and further guesswork. In the example discussed above, suppose that we guess, at the time of the test, that the type of document (or the proportion of different types) might be a source of problems. Then we could examine current input to the system (as tgainst the existing cumulated collection) to see whether such a change might already be happening. We could also look at the sources of documents [OCRerr]nd any changes that may be happening in the publication process. Having detected a change in some variable, we want to find out whether it may have important effects. We could, in principle, include this question in our experimental design: in the example, we may have to divide the collection into journal articles and research reports, and make separate measurements on the two collections. Finally, we want to make appropriate predictions. This would involve guesstimating the possible proportion of journal articles in five years' time (or at different times over the expected lifetime of the system), and weighting the results of our test appropriately. Artificial queries The foregoing discussion of sample adequacy assumes that the samples are taken from a situation X, we wish to make inferences about a situation Y, and we can make some reasonable guesses about the relation between X and Y. Earlier, I suggested that there are sometimes strong reasons for constructing artificial queries rather than acquiring real ones. Obviously, a set of artificial queries is in no sense a sample of any real population, either