IRE Information Retrieval Experiment The methodology of information retrieval experiment chapter Stephen E. Robertson Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. 18 The methodology of information retrieval experiment circumstances, one should try to avoid any bias that may be introduced by the ordering: one should not (if this is compatible with other aspects of the experimental design) present the output of one system and then the output of another; instead, the two should be mixed together. Many of these recommendations may in fact conflict with other aspects of the experimental design. Thus in testing highly interactive systems, one may need to obtain relevance judgements from the requester in the course of the search. In such circumstances, it may be necessary to introduce new experimental techniques in order to avoid some of the problems mentioned above. Finally, what instructions should be given to the judges, and in what form should the assessments be obtained? The usual method is to describe a small number of categories, such as `Answers the question completely', `Is of major importance in answering the question', `Is of marginal importance', `Is of no help at all'. Normally more than two categories are provided, although they are usually conflated into just two (relevant/non-relevant) at the analysis stage. This may seem a strange procedure, but it may be easier for a judge to use more than two categories, even if there is no experimental reason for obtaining the additional information. Also, there remains a feeling that we should have methods of analysis that take account of degrees of relevance; but, on the whole, no such methods exist. Some experiments on relevance have included attempts to get the judges to rank the documents rather than assign them to categories, and indeed there is some evidence that different judges are more consistent in their rankings than in assignments to categories. But again, no suitable methods of analysis exist for using ranked relevance judgements in retrieval tests. The problem of relevance is discussed in a wider context by Belkin in Chapter 4. Analysis Having obtained the basic measurements, one then has to analyse the data in such a way as to answer the questions which were the raison detre of the project. Such analysis may involve several stages: for example, we might successively: (1) calculate, for each request and system, an appropriate measure of the effectiveness or efficiency of the system's response to the request; (2) average this measure over the request set, for each system; (3) compare the averages for the different systems; and (4) perform a statistical significance test on the difference. In fact, the subject of how to analyse retrieval test data has been, with the problem of relevance, one of the two most highly debated topics in the field. The debate was originally simply about the choice of appropriate measure (of effectiveness, cost-effectiveness, benefit or whatever), but lately it has come to include all three other aspects as well. Indeed, it is difficult to separate the four: for example, there is a statistical significance test which has been used in this context which requires that the comparison between different systems be made at the individual request level, rather than after averaging. I