IRE Information Retrieval Experiment The methodology of information retrieval experiment chapter Stephen E. Robertson Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. Components of the archetype 13 need, rather than simply to present documents in response to a formal request. The documents and requests Almost all information retrieval system tests use genuine documents. Are there alternatives? It is possible to do some simulation experiments using pseudo-documents which are generated in some fashion (perhaps involving Monte Carlo techniques) so as to imitate real ones in some specific sense (they may, for example, imitate only the index sets associated with documents, rather than the documents themselves). There is certainly a role for such experiments, to answer specific research questions; but for most purposes it is both better and easier to use the real thing. There is certainly no shortage of documents around! The only remaining question, then, is: Which documents? Here we run into the vexed question of sampling, about which more below. With regard to requests, the situation is considerably more problematic. To be sure, there is (in principle, at least) no shortage of requests; the problems with obtaining such real requests are several. First, we have to catch them! Requests (in the sense of acts of requesting information) exist only for a short space of time, and have to be trapped at that time. Second, the actual locations of these request-acts are usually dispersed; a mechanism for trapping them that is located at one place may take a long time to trap a reasonable number or range of requests. Third, most test designs require, to a greater or lesser extent, the co-operation of the requester. This co-operation may be needed in connection with the operation of the system itself; it may also be needed for the measurement of system output (as discussed below). Fourth, all the three previous difficulties combine to exacerbate the problem of obtaining a sample of requests that is representative of anything. These problems have prompted many testers to construct artificial queries. Such artificial queries may vary in their degree of realism. Some examples are discussed below. Experimental design Broadly, experimental design is concerned with arranging matters so that the experiment does answer the question(s) it is intended to answer. Obviously all the other components discussed above also come under this broad heading. But the phrase is used in a somewhat narrower sense, referring to those aspects of the design that determine whether, from a logical or statistical point of view, appropriate inferences can be drawn. The simplest question one can ask in this context is: How large a sample of documents (say) do I need for this experiment? A more detailed example would be: How should I use the different (human) searchers available, with the different requests and alternative systems, so as to separate the effects I want to measure (the difference between the systems) from those that I don't (any differences between searchers or between requests, or interactions between searchers and/or requests and/or systems)? Even within this narrower scope of experimental design, it has many