IRE Information Retrieval Experiment The methodology of information retrieval experiment chapter Stephen E. Robertson Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. 26 The methodology of information retrieval experiment sample, and if that is not feasible (or not desirable for other reasons) whether any particular method that we might use to gather the test sets is likely to introduce biases of any sort. Consider first the case of a test on an operational service which is designed to answer questions about the service itself, not generalizations. How might we take samples, and what biases might be present in them? Time and related variables Perhaps the most obvious problem relates to time. The tester must necessarily use documents that already exist, and queries that either occur during the course of the test, or exist in some archive at the start (or perhaps are manufactured in some way for the purpose of the test). But he/she will be concerned with the future-probably with new documents which will enter the system at a later date, almost certainly with queries that are put to the system in the future. Thus in one strict sense, the samples cannot be representative of the situation about which the inferences are to be made. How much of this is likely to matter is an open question (and is certainly outside the realm of formal statistical inference). It is a question which has scarcely been investigated in the past. One could, however, think of ways to investigate it: for example, one could study the absolute and relative performance of different systems over a period of time. Such tests would help later researchers to assess the dangers of predicting from the past to the future, but would provide only indirect evidence on this score. It seems likely that many possible biases introduced by time will be not so much direct consequences as indirect effects relating tQ other variables which are themselves time-dependent. Two examples follow. The samples that are used for a test may not be representative of a future situation because the type of subject covered by the service may change with time. To some extent this may be a matter of deliberate planning, but it might also be because the nature of some subject that is already covered, or of the queries concerning it, change as the subject develops. Such changes may be reflected in the language of the subject, or in the internal organization of the documents about it, in a way which may have a direct bearing on retrieval. Another change which may happen to a document collection over a period of time is that the proportions of different types of documents (books, journal articles, research reports, conference proceedings, etc.) may vary. It seems likely (although this has never been tested) that different types of documents have different retrieval characteristics: so again such a change could affect retrieval performance. Effects of biases It is worth looking at the last example in a little more detail, so as to see why such a bias might be important and what we might do about it. Suppose that our document collection consists entirely of journal articles and research reports, and suppose that we are testing alternative systems A and B. We will take the existing collection (which is 90 per cent journal