IRE
Information Retrieval Experiment
The methodology of information retrieval experiment
chapter
Stephen E. Robertson
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
16 The methodology of information retrieval experiment
may be subject to direct experimental control (such as a threshold in a
clustering experiment); or they may be intermediate variables which need to
be measured (such as the number of terms in an index language, or inter-
indexer consistency). Some of the latter (e.g. number of terms again) can
easily be measured and do not affect the way the test is conducted; others
(e.g. inter-indexer consistency) would impose some requirements on the
design or conduct of the test.
Under some circumstances, such intermediate variables might be regarded
as alternatives to performance variables. Thus if we assume that high inter-
indexer consistency goes with good performance, we can test some aspects of
the system by measuring the former instead of the latter (in contradiction of
my assertion earlier of the necessity of testing the entire system). The
important (and in fact unsolved) problem here is of course the validity of the
assumption; certainly it would seem dangerous to rely on such an untested
hypothesis. However, good use has been made of similar intermediate
variables in explanatory experiments or investigations.
Measurement: performance limits and failure analysis
I have said that we would normally be interested in how well the system
responds to each query presented to it. But the answer to this question may
well beg answers to other questions, such as: What is the best possible
response to this query? How well does the system's response measure up
against this ideal? What are the reasons for falling short?
If we have measured performance in terms of relevant documents
retrieved, this suggests two ways in which the response of the system may
have fallen short of ideal: by retrieving non-relevant documents and by
failing to retrieve relevant ones. The former kind of failure will be apparent
immediately if all the documents retrieved by the system are assessed for
relevance. The second is more problematic-indeed[OCRerr] it is one of the major
headaches of information retrieval system testing.
How do we find out about those relevant documents which the system fails
to retrieve? In a laboratory experiment, with a small collection of documents,
it might just be feasible for the requester or a substitute to scan the entire
collection. But if there are more than a few hundred documents, this will be
out of the question. An obvious alternative would be to sample the collection
and scan the sample, but if one takes a typical operational collection and
extracts a sample that is small enough for a requester to scan comfortably, it
is unlikely to contain any relevant documents at all (since relevant documents
are generally very sparse in such a collection).
Most tests rely on methods that are not so satisfactory in a formal sense,
but are dictated by pragmatic considerations. In fact, if the object of the test
is simply to make a decision between two (or more) existing systems, then
there is no need to find these unretrieved relevant documents[OCRerr]ne need only
compare the relevant documents retrieved by one system with those retrieved
by the other. If, on the other hand, we are testing more than one system with
a view to analysing failures or assessing absolute performance, we might use
the relevant documents retrieved by system B but not system A to investigate
the failures of system A, and vice versa. This procedure may suffer from a
form of bias: those relevant documents retrieved by B but not A may well not