IRE Information Retrieval Experiment The methodology of information retrieval experiment chapter Stephen E. Robertson Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. 28 The methodology of information retrieval experiment now or in the future. Is there any way of ensuring that this set of artificial queries is representative of a real situation? Some ways of obtaining artificial queries are: to ask some actual or potential users for examples of queries they have put to any system in the recent past; to ask intermediaries (librarians or information officers) for examples of the sorts of queries put to them; to construct queries by some random choice of index terms, in such a way as to duplicate known statistical characteristics of real queries. Any such method obviously has at its heart an intention to produce artificial queries which in some sense `look like' real ones. The problem is, we do not really know what are the important characteristics of real ones that we should be trying to reproduce. To a limited extent, one may test for the representativeness of artificial queries provided there are some real queries available in some form, by looking at various measurable characteristics of the real and artificial sets (any characteristics that one can think of). If a bias is detected in this way, it may be possible to allow for it in the analysis of results. But such procedures are of only limited value. One possible statistical justification for using artificial queries is that we could, in principle, generate to order queries of a range of different types (that is, we could directly control some of the variables associated with queries). This would be a stronger justification if we possessed a reasonable typology of queries; at the moment, no such typology exists. Laboratory tests Again, in the foregoing discussion I have assumed that there is a specific situation about which we wish to make inferences (even if it is a postulated future situation). In laboratory tests, where we wish to make generalizations about system design, this is not the case. How then can we begin to make inferences? If we have two alternative general hypotheses, then we can test them against each other by the usual scientific methods. That is, we have to devise an experiment from which the two hypotheses would predict different results. Because of the vagaries of individual documents and queries, almost any general hypothesis is bound to include some (explicit or implicit) statistical element in its specific predictions: that is, no general hypothesis in information retrieval (of any importance, at least) can be expected to make deterministic predictions. (To take an extreme example, we would not expect to be able to support or disprove a general hypothesis on the basis of a test on two documents and one query.) Thus we can expect statistical considerations to play a part in such hypothesis testing. The problem is, how should we think about the experimental set-up in statistical terms? We have to regard the documents and queries as a sample from something (whether or not they actually are). The only way we can do so in general is to consider the criteria by which they were selected (or constructed), and define notional populations of all the (actual or potential) documents or queries satisfying these criteria. Then we can hope to make inferences about whether or not any general hypothesis holds for these notional populations. As in any scientific field, although we might possibly reject a general