IRE
Information Retrieval Experiment
The methodology of information retrieval experiment
chapter
Stephen E. Robertson
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
Statistical ideas and questions 27
irticles) as our test collection of documents, but suppose that five years hence
I lie proportion of journal articles will be more like 50 per cent.
System A, as it happens, is based on the title of documents, whereas system
l[OCRerr] involves some intellectual indexing. Because research reports are on the
whole longer and more substantial documents than journal articles, they are
represented (on the whole) by more index terms in system B; but their titles
.[OCRerr]re of very similar length, so in A the two types of documents tend to have
.[OCRerr]irnilar size representations.
Under these conditions, we might surmise, system B is a good deal more
expensive than system A, and works considerably better on reports but
rotighly the same on articles. Thus our test will show a marginal performance
.I(lvantage to B, but at greatly increased cost; on a cost-effectiveness basis, we
iii ight well feel justified in choosing A.
But as the proportion of research reports rises in the future, the average
performance difference between the systems will increase. So we may have
made a mistake, as far as the situation in five years' time is concerned.
The questions that arise from this example are: how could we detect this
change in the makeup of the collection; how could we assess its importance;
md how could we make appropriate adjustments to our results. These
(luestions are closely connected because we are only interested in looking for
changes that may be important. The problem is, we have little idea of which
variables may have major effects. Below, I discuss the paucity of results from
Itboratory tests that might help in this situation.
So, for the tester of operational systems, the only way ahead is to make a
[OCRerr]tiess at any variables that may be important. The question of how to detect
changes in these variables is clearly one of observation and further guesswork.
In the example discussed above, suppose that we guess, at the time of the test,
that the type of document (or the proportion of different types) might be a
source of problems. Then we could examine current input to the system (as
tgainst the existing cumulated collection) to see whether such a change
might already be happening. We could also look at the sources of documents
[OCRerr]nd any changes that may be happening in the publication process.
Having detected a change in some variable, we want to find out whether
it may have important effects. We could, in principle, include this question
in our experimental design: in the example, we may have to divide the
collection into journal articles and research reports, and make separate
measurements on the two collections. Finally, we want to make appropriate
predictions. This would involve guesstimating the possible proportion of
journal articles in five years' time (or at different times over the expected
lifetime of the system), and weighting the results of our test appropriately.
Artificial queries
The foregoing discussion of sample adequacy assumes that the samples are
taken from a situation X, we wish to make inferences about a situation Y,
and we can make some reasonable guesses about the relation between X and
Y. Earlier, I suggested that there are sometimes strong reasons for
constructing artificial queries rather than acquiring real ones. Obviously, a
set of artificial queries is in no sense a sample of any real population, either