IRE
Information Retrieval Experiment
The pragmatics of information retrieval experimentation
chapter
Jean M. Tague
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
Decision 3; Uow to operationalize the variables? 67
Any three of these will determine the fourth. For example, if recall 1/2,
generality 2/27, fallout 1/100, then
a c, d = 99h, 27(a+c) = 2(a+h+c+d), a = 4b and
precision = 4/5 = 0.8.
However, two alone do not determine the others. For example, if generality
remains constant at 2/27, but recall increases to 3/5, then precision may
increase, remain constant, or decrease depending on fallout. If fallout
remains constant at 1/100, then
precision = 24/29 = 0.828.
If fallout increases to 1/50, then
precision = 12/17 = 0.706.
If fallout increases only to 3/250, then
precision = 4/5 = 0.8
i.e. remains constant. Thus, the statement that as recall increases, precision
decreases, may be an empirical characteristic of a particular retrieval system,
but does not follow formally from the properties of recall and precision.
If output is ranked, totally or with ties, then recall and precision can be
calculated at each rank, using the rank as a retrieval threshold. If output is
weighted, recall and precision can be calculated at standard weight
thresholds. In both cases, values may be averaged to obtain a single value.
However, this approach is not very realistic, as all threshold levels are not
equally likely to be appropriate for a query. Some form of weighted averaging
may be more appropriate.
Two practical problems arise in determining recall and precision;
How is relevance of the references to be assessed?
How are all relevant items in the file to be found?
A thorough review of the concept of relevance will be found in Saracevic8.
Pragmatically, the problem lies in deciding on the scale of relevance and then
instructing the evaluators so that they will carry out the relevance assessments
in a consistent manner. The following scales have been used;
Binary relevance: a reference is either relevant or non-relevant.
Three-value relevance; a reference may be relevant or highly relevant,
probably or partially relevant, or non-relevant.
Ranked relevance: references are ranked with respect to relevance. Ties
may or may not be permitted.
Relevance weights; each reference is assigned a weight by the user,
indicating the strength of its relevance to the query.
In choosing among these scales, consideration must be given to reliability,
i.e. is there consistency in the relevance ratings by the same individuals at
different times and different individuals for the same query? Studies, for
example Lesk and Salton9 and Rees and Schultz10, have shown relevance
rankings to be relatively stable. Borderline problems frequently arise in
making a binary distinction, and these are not really solved by the three-
value scale. This simply replaces one borderline by two. Relevance weights