IRE Information Retrieval Experiment The pragmatics of information retrieval experimentation chapter Jean M. Tague Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. Decision 3; Uow to operationalize the variables? 67 Any three of these will determine the fourth. For example, if recall 1/2, generality 2/27, fallout 1/100, then a c, d = 99h, 27(a+c) = 2(a+h+c+d), a = 4b and precision = 4/5 = 0.8. However, two alone do not determine the others. For example, if generality remains constant at 2/27, but recall increases to 3/5, then precision may increase, remain constant, or decrease depending on fallout. If fallout remains constant at 1/100, then precision = 24/29 = 0.828. If fallout increases to 1/50, then precision = 12/17 = 0.706. If fallout increases only to 3/250, then precision = 4/5 = 0.8 i.e. remains constant. Thus, the statement that as recall increases, precision decreases, may be an empirical characteristic of a particular retrieval system, but does not follow formally from the properties of recall and precision. If output is ranked, totally or with ties, then recall and precision can be calculated at each rank, using the rank as a retrieval threshold. If output is weighted, recall and precision can be calculated at standard weight thresholds. In both cases, values may be averaged to obtain a single value. However, this approach is not very realistic, as all threshold levels are not equally likely to be appropriate for a query. Some form of weighted averaging may be more appropriate. Two practical problems arise in determining recall and precision; How is relevance of the references to be assessed? How are all relevant items in the file to be found? A thorough review of the concept of relevance will be found in Saracevic8. Pragmatically, the problem lies in deciding on the scale of relevance and then instructing the evaluators so that they will carry out the relevance assessments in a consistent manner. The following scales have been used; Binary relevance: a reference is either relevant or non-relevant. Three-value relevance; a reference may be relevant or highly relevant, probably or partially relevant, or non-relevant. Ranked relevance: references are ranked with respect to relevance. Ties may or may not be permitted. Relevance weights; each reference is assigned a weight by the user, indicating the strength of its relevance to the query. In choosing among these scales, consideration must be given to reliability, i.e. is there consistency in the relevance ratings by the same individuals at different times and different individuals for the same query? Studies, for example Lesk and Salton9 and Rees and Schultz10, have shown relevance rankings to be relatively stable. Borderline problems frequently arise in making a binary distinction, and these are not really solved by the three- value scale. This simply replaces one borderline by two. Relevance weights