IRE Information Retrieval Experiment The methodology of information retrieval experiment chapter Stephen E. Robertson Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. Statistical ideas and questions 23 Collections are normallycommunicated in machine-readable form (on tape); documents are usually available as the texts of abstracts, and/or some form of index representation. The existence of these collections has had a considerable influence on the direction of research in the field, for the simple reason that some processes (such as automatic indexing from full text) are not possible on these collections as they currently exist. In these circumstances, it is at least urguable that the research community should set up one or more genuinely portable test collections: collections that are designed as general-purpose research tools, rather than taking on that role by accident. Although some work has been done in the last few years on the desirable characteristics of a portable test collection, no such collection has been built. But this is clearly `I direction in which future laboratory work in document retrieval might move. 2.4 Statistical ideas and questions Why statistics? A test of a retrieval system necessarily involves, as we have seen, some kind I)f measurement (in a general sense of the word) of certain aspects of the way itie system works. But this information about the system is of necessity tiistorical[OCRerr]it concerns acts of retrieval which have already happened. The )Illy ultimate reason for testing a retrieval system must be to discover or infer [OCRerr]()mething about future acts of retrieval, either in the sense of future requests I)tit to the same system, or in the sense of general principles (from which l).Irticular deductions about the future might be made). Such inferences are lie subject-matter of statistics. More particularly, having performed a comparison of two systems on [OCRerr]pccific samples of documents and requests, we may be interested in the [OCRerr]t[OCRerr]ttistical significance of the difference, that is in whether the difference we observe could be simply an accidental property of the sample or can be .`t.[OCRerr]sumed to represent a genuine characteristic of the populations. Further, we Ifl([OCRerr]y want to enlist the aid of statistical methods in discovering the underlying reasons for what we observe. We can illustrate the peculiar difficulty of applying statistical methods to `uformation retrieval test data by first describing an unrealistically simple [OCRerr]ituation. The rest of this chapter is devoted to an examination of the underlying problems that emerge as we try to deal with reality. More concrete iccommendations and suggestions are provided by Tague in Chapter 5. A simple case ([OCRerr]onsider the case of an operational test which is designed to decide between IWO existing alternative systems, for a particular collection of documents and .1 particular clientele. Assume further that (a) the collection of documents is [OCRerr]()mplete, and will not be added to or changed in the future, and (b) the characten sties of the clientele, and of the kinds of requests that they make, will not change in the future. Then we have a reasonably good case from the l)Oint of view of statistics; if we use a random sample of the incoming