IRE Information Retrieval Experiment The pragmatics of information retrieval experimentation chapter Jean M. Tague Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. 68 The pragmatics of information retrieval experimentation have been little used, perhaps in recognition of their inherent unreliability. Most psychologists use no more than seven points in a scale, perhaps because it has been found that humans can rarely make distinctions beyond this range. (See Miller11, for example.) It is important that all individuals making relevance assessments receive the same instructions. It has frequently been pointed out that relevance embodies two distinct notions: Is the document an answer, i.e. is it about the subject of the query? Will it be useful to the user? If, for example, the user has already read the document, it will not be useful? Users should be clear whether they are assessing subject relevance or pertinence. Sometimes pertinence is operationalized by asking the question: Would you look at the document represented by this reference? Or by checking whether, in fact, the user did order or read the document. In addition, users should be instructed what form of output should be used in makin[OCRerr] relevance judgements. Previous experiments have shown (see Saracevic ) that relevance judgements are influenced by form: title, full citation, abstract, full text. Determination of the full set of relevant documents in the file, which is necessary for determining recall, is a problem which has dogged information retrieval experimentation since Cranfield 1. Some solutions which have been used are as follows: (1) One or more predetermined relevant documents are included in the file. The problem here is that unless the full file is perused, one cannot be sure other documents may not be relevant. Two ways of predetermining the relevant set are (a) asking the author of a paper to state a query based on the paper and assess the relevance of all papers cited, and (b) use the title as a query and the cited papers as the relevant set. This second approach, in particular, has the disadvantage that relevance is operationalized in a very arbitrary, non-judgemental fashion and hence is of questionable validity. (2) Use a small document set and have the relevance of all documents for all queries assessed by users or system personnel. Here, of course, the problem is that small files are not very reliable, i.e. they are subject to wide variation from file to file. (3) Take a random sample from the file and assess all documents in the sample as to relevance. The problem here is similar to that with small files. In most operational systems, the generality will be very low, so that the sample size needed to assess it accurately will be very large. For example, if there are actually 50 relevant documents in a file of 50000 (a not unreasonable generality of 1/1000), then the sample size needed to estimate the total number of relevant documents at a 95 per cent confidence level and error less than 0.0001 will be the value of n which satisfies the equation 0.0001 = 1.96[OCRerr][OCRerr]l(0.999) 50000-n 50000