IRE
Information Retrieval Experiment
Retrieval effectiveness
chapter
Cornelis J. van Rijsbergen
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
38 Retrieval effectiveness
Since P(w2/x) = 1 - P(w1/x), we can rewrite this inequality as
P(w1/x)>
(=p)
This rule is similar to the one specified by the probability ranking principle,
the difference being that now we have an explicitly defined cut-off in terms
of a cost function. If we choose 112=121, i.e. [OCRerr]= 1 our retrieval rule becomes:
IfP(w1/x) > + then retrieve
else do not retrieve
Or in other words, if the probability of relevance is greater than the
probability of non-relevance for a document we should retrieve that
document. We vary the importance we attach to the cost of a false drop in
comparison with the cost of a recall failure by changing the cut-off. For
example, we may decide that
112
121
which means that the failure to retrieve a relevant document is twice as costly
as the retrieval of a non-relevant document. For this the cut-off fi is set to +.
By ranking we avoid having to specify the cut-off in advance, on the other
hand we pay the price of ranking. It is important to realize that setting a cut-
off on P(w1/x) still maximizes the expected number of relevant documents in
the retrieved set.
3.4 Measurement of effectiveness
The measurement of retrieval effectiveness within an experimental set-up is
beset with many difficulties. These difficulties have been with us for many
years, and are likely to remain unresolved for many years yet. A typical
retrieval experiment has been described in Chapter 2, 50 I shall not repeat it
here, except to emphasize that its aim is usually to establish the absolute or
relative effectiveness of some search strategy, information structure, ordering
process, etc., within the context of an overall retrieval system. The output for
such an experiment may be a ranking (partial or full) of documents, or simply
an unordered set. Each query will have associated with it some output for
which retrieval effectiveness measures can be calculated. In comparing the
results for different tests with the same queries and document collection, one
aims to produce statistical summary data which will enable statements to be
made about the comparative merits of differently designed subsystems. In
the main, experimentalists have concentrated on two types of statements.
(1) What is the probability that a retrieved document is relevant for the
system operating at that level of recall?
(2) What is the probability of a retrieved document being relevant for a
query at a particular recall value?
How well these two questions are answered depends on the method of
evaluation adopted. The discussion will concentrate on evaluation of
rankings, evaluating unordered sets is a special case of this.