IRE
Information Retrieval Experiment
The methodology of information retrieval experiment
chapter
Stephen E. Robertson
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
Statistical ideas and questions 25
One might say, then, that the state of the art consists of a number of more-
or-less standard statistical techniques applied to the query set. It should be
remembered, however, that such an approach deals with only part of the
statistical problem.
Parametric and non-parametric statistics
Most methods of statistical inference (such as significance tests) in common
use are based on assumptions about the population from which the sample is
drawn, and in particular on assumptions about the distribution (in the
population) of the particular variable being measured. Thus for example
many significance tests assume an underlying normal (gaussian) distribution.
Any such statistical method is described as `parametric'.
Unfortunately, many of the variables that one commonly wishes to
measure in retrieval tests do not satisfy these criteria. A good example is
r[OCRerr]'call: that is, the proportion of the relevant documents that are retrieved.
Because there may well be fev[OCRerr] [OCRerr]levant documents for any given query, and
because the values of recall for individual queries may be very widely spread,
the distribution of recall values over queries tends to look very strange
indeed. In particular, one tends to find many occurrences of the extreme
values (0 or 100 per cent), and many occurrences of those values that happen
to be low-denominator fractions (e.g. 75, 33, 60 per cent).
Under such circumstances it is offen difficult to find suitable parametric
assumptions, and one has to have recourse to non-parametric methods. This
is a fairly severe limitation: the range of non-parametric methods is
somewhat restricted.
Sample size
Fven supposing that the variable we are measuring would allow us, in
principle, to apply some particular statistical test, are we likely to be able to
obtain adequate samples of documents and queries for the test? This question
has several aspects; I will consider first the purely statistical aspect of sample
size.
As implied above, the documents seldom represent a problem in this
context: it is normally easy enough to get hold of, and to input into the
system(s), quite sufficient numbers of documents. (This is easiest if the
documents are available in a suitable form; most difficult if some
tundamentally new form of indexing has to be applied to them; but either
way can be done given only sufficient resources.)
The real problem arises with the queries. I have suggested that `trapping'
the queries at an appropriate moment of their existence, and obtaining the
necessary co-operation of the requesters, is by no means a trivial task. There
is some evidence to suggest that the results of many past tests, relying on tens
rather than hundreds of queries, are of doubtful validity for that reason if for
no other. The problem is compounded by the large range of variation between
queries of almost any variable of interest, and the comparatively small
differences between systems that seem to be common.
But the question of sample adequicy is very much wider than that of
numbers. We have to consider whether we can take a genuinely random