IRE
Information Retrieval Experiment
The methodology of information retrieval experiment
chapter
Stephen E. Robertson
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
14 The methodology of information retrieval experiment
aspects, and it is a little difficult to generalize about the design of retrieval
tests in this sense. There is, however, one fairly clear-cut example which can
be discussed here. Most experiments to date have involved using just one set
of requests, and trying each request on both or all the systems to be compared
(i.e. `replicating' the searches). There are clear statistical reasons for doing
this, if possible: since requests are difficult to obtain for the reasons discussed
above, one is usually working with relatively small numbers of them; and any
statistical significance testing to be done on the results can be made much
more efficient by a `matched pairs' procedure, whereby the performance of
the two (or more) systems on any one request is compared.
However, there are some circumstances under which this is not possible.
If one wishes to compare highly interactive systems, for example, where the
user is encouraged by the system to provide additional information about
his/her need, then one cannot put the same `request' (i.e. user need) to two
different systems, since the user will have learnt too much from the first
system.
Statistical aspects of retrieval testing are discussed further below, and by
Tague in Chapter 5.
Measurement: performance
What are the basic measurements with which a retrieval test is likely to be
concerned? Most information retrieval tests are ultimately concerned with
the effectiveness or performance of each system, or the benefits which derive
from its use, or cost-effectiveness or -benefit. Central to all of these questions
is the question of how well the system responds to each query presented to it.
This `how well' can be looked at in many different ways: how closely each
document output by the system matches the user's need; how useful each
document is in satisfying the need; how satisfied the user is with the output
as a whole; and so on.
It may seem strange, to anyone more familiar with the harder sciences,
that I refer to such an obviously subjective matter under the heading of
`measurement'. However, it is clearly a direct consequence of my definition
above of the function of an information retrieval system, that some such
subjective notion must enter into any assessment of information retrieval
system performance.
Most commonly, documents output by the system are individually assessed
for relevance to the user's need. The word `relevance' has been used in many
different ways, but broadly it corresponds to the first of the three questions
above: that is, how well does the document match the user's need. Both the
notion itself and its appropriateness to retrieval tests are the subject of much
debate and also some experiment. Generally speaking, the assessment of
relevance allows of a `harder' form of analysis than any other assessment in
this category of subjective responses to system output, since for example it
allows one to ask the question: Why did the system fail on such-and-such a
document? On the other hand, utility or user satisfaction may be regarded as
being closer to the true objective of an information retrieval system, and
therefore better or more valid measurements to make when trying to assess
system performance. The debate continues.