IRE
Information Retrieval Experiment
The methodology of information retrieval experiment
chapter
Stephen E. Robertson
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
24 The methodology of information retrieval experiment
requests, and the entire existing document collection, then inferences can be
made by standard statistical techniques (such as significance tests). Indeed,
we can to some extent reverse this procedure, and calculate what sample size
is required in order to establish a certain difference between the two systems
at a given level of confidence.
Unfortunately, the situation is rarely so simple. The complications, as can
be guessed from the specification of the simple case, are many and various.
To a large extent, the problems are as yet unsolved; some of them admit (in
principle, at least) of a statistical solution; some of them would certainly
require other ideas to be combined with the statistical ones, ideas which
might for example be described as linguistic, psychological, epistemological
or even simply retrieval-theoretic'.
Two populations
I assumed in the simple case that, in moving from the situation we are
measuring to the situation about which we wish to make inferences, the set
of queries changes but the set of documents remains the same. It is possible
to imagine an experiment in which the two roles are reversed: an experiment
concerned with certain specified SDI queries, with the document collection
being completely new each month. In such a case, we would regard the
document collection as a sample and make statistical inferences accordingly.
But far more commonly, we have the situation in which neither the query
set nor the document collection remains the same. Even in most straightfor-
ward tests on operational systems, the document collection changes more or
less gradually with time; and one is seldom in a position where one wants to
know only about existing queries. So the normal case is one in which we have
to consider both the test set of queries and the test collection of documents as
(in some sense) samples from a population.
Suppose, then, that we can regard both samples as random: that is, in both
cases, the sample is representative of the population, with no systematic
differences or biases. In these circumstances, can we call in standard
statistical techniques in order to make inferences about the two populations
and their interactions from the measurements that we make on the samples?
Even for this (still comparatively simple) case, the answer is no: although
in principle the problem remains a purely statistical one, very little exists in
the way of standard methods which are formally valid under such conditions.
As a result, many testers have tried to apply statistical methods which assume
only one sampling process, and have simply ignored the second. Early work
on these lines tended to use the document as the critical unit: that is, to
regard the test collection of documents as a random sample from a population,
and to ignore the problem in connection with requests. However, more
recent work has tended to follow the reverse view. There are two reasons for
this change. The first is that some of the measurements that have been used
are query-oriented, and in order to make any inferences at all with such
measures one must consider the queries as a sample (whatever one does
about the documents). The second is that in general, the number of queries
tends to be a much more critical quantity than the number of documents: for
reasons which will be clear from earlier discussions, the tester usually has
access to many more documents than requests.
I
I
m
I
I
I