IRE Information Retrieval Experiment The methodology of information retrieval experiment chapter Stephen E. Robertson Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. 16 The methodology of information retrieval experiment may be subject to direct experimental control (such as a threshold in a clustering experiment); or they may be intermediate variables which need to be measured (such as the number of terms in an index language, or inter- indexer consistency). Some of the latter (e.g. number of terms again) can easily be measured and do not affect the way the test is conducted; others (e.g. inter-indexer consistency) would impose some requirements on the design or conduct of the test. Under some circumstances, such intermediate variables might be regarded as alternatives to performance variables. Thus if we assume that high inter- indexer consistency goes with good performance, we can test some aspects of the system by measuring the former instead of the latter (in contradiction of my assertion earlier of the necessity of testing the entire system). The important (and in fact unsolved) problem here is of course the validity of the assumption; certainly it would seem dangerous to rely on such an untested hypothesis. However, good use has been made of similar intermediate variables in explanatory experiments or investigations. Measurement: performance limits and failure analysis I have said that we would normally be interested in how well the system responds to each query presented to it. But the answer to this question may well beg answers to other questions, such as: What is the best possible response to this query? How well does the system's response measure up against this ideal? What are the reasons for falling short? If we have measured performance in terms of relevant documents retrieved, this suggests two ways in which the response of the system may have fallen short of ideal: by retrieving non-relevant documents and by failing to retrieve relevant ones. The former kind of failure will be apparent immediately if all the documents retrieved by the system are assessed for relevance. The second is more problematic-indeed[OCRerr] it is one of the major headaches of information retrieval system testing. How do we find out about those relevant documents which the system fails to retrieve? In a laboratory experiment, with a small collection of documents, it might just be feasible for the requester or a substitute to scan the entire collection. But if there are more than a few hundred documents, this will be out of the question. An obvious alternative would be to sample the collection and scan the sample, but if one takes a typical operational collection and extracts a sample that is small enough for a requester to scan comfortably, it is unlikely to contain any relevant documents at all (since relevant documents are generally very sparse in such a collection). Most tests rely on methods that are not so satisfactory in a formal sense, but are dictated by pragmatic considerations. In fact, if the object of the test is simply to make a decision between two (or more) existing systems, then there is no need to find these unretrieved relevant documents[OCRerr]ne need only compare the relevant documents retrieved by one system with those retrieved by the other. If, on the other hand, we are testing more than one system with a view to analysing failures or assessing absolute performance, we might use the relevant documents retrieved by system B but not system A to investigate the failures of system A, and vice versa. This procedure may suffer from a form of bias: those relevant documents retrieved by B but not A may well not