IRE
Information Retrieval Experiment
The pragmatics of information retrieval experimentation
chapter
Jean M. Tague
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
68 The pragmatics of information retrieval experimentation
have been little used, perhaps in recognition of their inherent unreliability.
Most psychologists use no more than seven points in a scale, perhaps because
it has been found that humans can rarely make distinctions beyond this
range. (See Miller11, for example.)
It is important that all individuals making relevance assessments receive
the same instructions. It has frequently been pointed out that relevance
embodies two distinct notions:
Is the document an answer, i.e. is it about the subject of the query?
Will it be useful to the user? If, for example, the user has already read the
document, it will not be useful?
Users should be clear whether they are assessing subject relevance or
pertinence. Sometimes pertinence is operationalized by asking the question:
Would you look at the document represented by this reference?
Or by checking whether, in fact, the user did order or read the document.
In addition, users should be instructed what form of output should be used
in makin[OCRerr] relevance judgements. Previous experiments have shown (see
Saracevic ) that relevance judgements are influenced by form: title, full
citation, abstract, full text.
Determination of the full set of relevant documents in the file, which is
necessary for determining recall, is a problem which has dogged information
retrieval experimentation since Cranfield 1. Some solutions which have been
used are as follows:
(1) One or more predetermined relevant documents are included in the file.
The problem here is that unless the full file is perused, one cannot be sure
other documents may not be relevant. Two ways of predetermining the
relevant set are (a) asking the author of a paper to state a query based on
the paper and assess the relevance of all papers cited, and (b) use the title
as a query and the cited papers as the relevant set. This second approach,
in particular, has the disadvantage that relevance is operationalized in a
very arbitrary, non-judgemental fashion and hence is of questionable
validity.
(2) Use a small document set and have the relevance of all documents for all
queries assessed by users or system personnel. Here, of course, the
problem is that small files are not very reliable, i.e. they are subject to
wide variation from file to file.
(3) Take a random sample from the file and assess all documents in the
sample as to relevance. The problem here is similar to that with small
files. In most operational systems, the generality will be very low, so that
the sample size needed to assess it accurately will be very large. For
example, if there are actually 50 relevant documents in a file of 50000 (a
not unreasonable generality of 1/1000), then the sample size needed to
estimate the total number of relevant documents at a 95 per cent
confidence level and error less than 0.0001 will be the value of n which
satisfies the equation
0.0001 = 1.96[OCRerr][OCRerr]l(0.999)
50000-n
50000