IRE
Information Retrieval Experiment
The methodology of information retrieval experiment
chapter
Stephen E. Robertson
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
18 The methodology of information retrieval experiment
circumstances, one should try to avoid any bias that may be introduced by
the ordering: one should not (if this is compatible with other aspects of the
experimental design) present the output of one system and then the output of
another; instead, the two should be mixed together.
Many of these recommendations may in fact conflict with other aspects of
the experimental design. Thus in testing highly interactive systems, one may
need to obtain relevance judgements from the requester in the course of the
search. In such circumstances, it may be necessary to introduce new
experimental techniques in order to avoid some of the problems mentioned
above.
Finally, what instructions should be given to the judges, and in what form
should the assessments be obtained? The usual method is to describe a small
number of categories, such as `Answers the question completely', `Is of major
importance in answering the question', `Is of marginal importance', `Is of no
help at all'. Normally more than two categories are provided, although they
are usually conflated into just two (relevant/non-relevant) at the analysis
stage. This may seem a strange procedure, but it may be easier for a judge to
use more than two categories, even if there is no experimental reason for
obtaining the additional information. Also, there remains a feeling that we
should have methods of analysis that take account of degrees of relevance;
but, on the whole, no such methods exist. Some experiments on relevance
have included attempts to get the judges to rank the documents rather than
assign them to categories, and indeed there is some evidence that different
judges are more consistent in their rankings than in assignments to categories.
But again, no suitable methods of analysis exist for using ranked relevance
judgements in retrieval tests.
The problem of relevance is discussed in a wider context by Belkin in
Chapter 4.
Analysis
Having obtained the basic measurements, one then has to analyse the data
in such a way as to answer the questions which were the raison detre of the
project. Such analysis may involve several stages: for example, we might
successively:
(1) calculate, for each request and system, an appropriate measure of the
effectiveness or efficiency of the system's response to the request;
(2) average this measure over the request set, for each system;
(3) compare the averages for the different systems; and
(4) perform a statistical significance test on the difference.
In fact, the subject of how to analyse retrieval test data has been, with the
problem of relevance, one of the two most highly debated topics in the field.
The debate was originally simply about the choice of appropriate measure
(of effectiveness, cost-effectiveness, benefit or whatever), but lately it has
come to include all three other aspects as well. Indeed, it is difficult to
separate the four: for example, there is a statistical significance test which
has been used in this context which requires that the comparison between
different systems be made at the individual request level, rather than after
averaging.
I