IRE
Information Retrieval Experiment
The methodology of information retrieval experiment
chapter
Stephen E. Robertson
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
26 The methodology of information retrieval experiment
sample, and if that is not feasible (or not desirable for other reasons) whether
any particular method that we might use to gather the test sets is likely to
introduce biases of any sort. Consider first the case of a test on an operational
service which is designed to answer questions about the service itself, not
generalizations. How might we take samples, and what biases might be
present in them?
Time and related variables
Perhaps the most obvious problem relates to time. The tester must necessarily
use documents that already exist, and queries that either occur during the
course of the test, or exist in some archive at the start (or perhaps are
manufactured in some way for the purpose of the test). But he/she will be
concerned with the future-probably with new documents which will enter
the system at a later date, almost certainly with queries that are put to the
system in the future.
Thus in one strict sense, the samples cannot be representative of the
situation about which the inferences are to be made. How much of this is
likely to matter is an open question (and is certainly outside the realm of
formal statistical inference). It is a question which has scarcely been
investigated in the past. One could, however, think of ways to investigate it:
for example, one could study the absolute and relative performance of
different systems over a period of time. Such tests would help later researchers
to assess the dangers of predicting from the past to the future, but would
provide only indirect evidence on this score.
It seems likely that many possible biases introduced by time will be not so
much direct consequences as indirect effects relating tQ other variables which
are themselves time-dependent. Two examples follow.
The samples that are used for a test may not be representative of a future
situation because the type of subject covered by the service may change with
time. To some extent this may be a matter of deliberate planning, but it
might also be because the nature of some subject that is already covered, or
of the queries concerning it, change as the subject develops. Such changes
may be reflected in the language of the subject, or in the internal organization
of the documents about it, in a way which may have a direct bearing on
retrieval.
Another change which may happen to a document collection over a period
of time is that the proportions of different types of documents (books, journal
articles, research reports, conference proceedings, etc.) may vary. It seems
likely (although this has never been tested) that different types of documents
have different retrieval characteristics: so again such a change could affect
retrieval performance.
Effects of biases
It is worth looking at the last example in a little more detail, so as to see why
such a bias might be important and what we might do about it.
Suppose that our document collection consists entirely of journal articles
and research reports, and suppose that we are testing alternative systems A
and B. We will take the existing collection (which is 90 per cent journal