IRE
Information Retrieval Experiment
The methodology of information retrieval experiment
chapter
Stephen E. Robertson
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
Statistical ideas and questions 29
l\l)()thCSiS on the basis of a single test, one test is certainly an insufficient
l),[OCRerr]sis lor acceptance: one must look for a number of different ways to test a
lIvI)()thesis before accepting it, even if only provisionally. In information
I([OCRerr]I[ieval, this has generally meant testing on several different test collections
(I) (locuments, queries and relevance judgements). The reason for this form
iiiultiple testing is that the most obvious variable (which could cause a
lI\l)()thesis which works under some conditions to fail under others) is
[OCRerr].IIl)jcct: the different test collections are usually in different subject areas. But
Ii(tle attention has as yet been paid to other variables which might cause
l)I()blems, such as document or query type, or heterogeneity of the document
([OCRerr])lIection in terms of subject matter or date. This lack is partly a function of
vi ilability of resources: as discussed above, test facilities which would allow
`[OCRerr]iich tests to be made do not exist at present and would be expensive to set up.
As I have indicated, this scarcity of results from laboratory tests on the
v,[OCRerr]irious variables associated with document and query collections which
iii ight influence the results of retrieval tests is also unfortunate from the point
[OCRerr]l' view of operational system testers. It is to be hoped that more work will be
[OCRerr]l()ne on these problems.
l'.xperimental design
far, I have assumed the problem to be: `Given the results of this test, what
cm we infer?'. But one can also approach the statistical aspects from the
opposite direction: `Given the sort ofinferences Tam looking for, how should
I design my test to ensure that I get suitable results'?'.
The obvious and commonest application of this idea is to sample size.
Suppose that we want to ensure (at least to a certain level of confidence) that,
ifsystem A really performs so much better than system B, then the test results
will lead to the correct inference. Assuming we know in advance which
significance test we are going to use, and something about the distributions
[OCRerr] the variables we are measuring, then it is possible to specify a minimum
sample size to achieve this aim.
Because of the difficulties of finding suitable methods, few testers actually
do statistical significance tests, let alone define the minimum sample size in
advance. So this kind of procedure is not yet common in retrieval tests,
though it should become more so.
A second procedure common in experimental design generally is concerned
with the control of variables. Suppose that we are to do a test involving a
small number of searchers (intermediaries) on a number of different systems.
The object of the exercise is to compare the systems, but it may be that the
choice of searcher will have a strong influence on the results for an individual
query. Further, this influence may depend on the combination of searcher
and system, rather than just the searcher. So we must devise a method for
ensuring that the variations between searchers do not in any way distort the
comparison between the systems. There are well established methods, such
as Latin square designs, for coping with this kind of problem; some such
methods have been used to good effect in retrieval tests.
Again, suppose we are testing alternative relevance feedback procedures.
The problem is to isolate, in some way, the effect of the relevance feedback
from the performance of the system without feedback. This is not an entirely