IRE Information Retrieval Experiment The methodology of information retrieval experiment chapter Stephen E. Robertson Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. 14 The methodology of information retrieval experiment aspects, and it is a little difficult to generalize about the design of retrieval tests in this sense. There is, however, one fairly clear-cut example which can be discussed here. Most experiments to date have involved using just one set of requests, and trying each request on both or all the systems to be compared (i.e. `replicating' the searches). There are clear statistical reasons for doing this, if possible: since requests are difficult to obtain for the reasons discussed above, one is usually working with relatively small numbers of them; and any statistical significance testing to be done on the results can be made much more efficient by a `matched pairs' procedure, whereby the performance of the two (or more) systems on any one request is compared. However, there are some circumstances under which this is not possible. If one wishes to compare highly interactive systems, for example, where the user is encouraged by the system to provide additional information about his/her need, then one cannot put the same `request' (i.e. user need) to two different systems, since the user will have learnt too much from the first system. Statistical aspects of retrieval testing are discussed further below, and by Tague in Chapter 5. Measurement: performance What are the basic measurements with which a retrieval test is likely to be concerned? Most information retrieval tests are ultimately concerned with the effectiveness or performance of each system, or the benefits which derive from its use, or cost-effectiveness or -benefit. Central to all of these questions is the question of how well the system responds to each query presented to it. This `how well' can be looked at in many different ways: how closely each document output by the system matches the user's need; how useful each document is in satisfying the need; how satisfied the user is with the output as a whole; and so on. It may seem strange, to anyone more familiar with the harder sciences, that I refer to such an obviously subjective matter under the heading of `measurement'. However, it is clearly a direct consequence of my definition above of the function of an information retrieval system, that some such subjective notion must enter into any assessment of information retrieval system performance. Most commonly, documents output by the system are individually assessed for relevance to the user's need. The word `relevance' has been used in many different ways, but broadly it corresponds to the first of the three questions above: that is, how well does the document match the user's need. Both the notion itself and its appropriateness to retrieval tests are the subject of much debate and also some experiment. Generally speaking, the assessment of relevance allows of a `harder' form of analysis than any other assessment in this category of subjective responses to system output, since for example it allows one to ask the question: Why did the system fail on such-and-such a document? On the other hand, utility or user satisfaction may be regarded as being closer to the true objective of an information retrieval system, and therefore better or more valid measurements to make when trying to assess system performance. The debate continues.