IRE
Information Retrieval Experiment
The pragmatics of information retrieval experimentation
chapter
Jean M. Tague
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
92 The pragmatics of information retrieval experimentation
Most information retrieval experiments are carried out for one or more of
the following purposes:
estimation;
comparison;
exploring relationships;
prediction.
To describe all the statistical tests which have been proposed for these
problems would require many volumes. This chapter, for the most part, will
simply indicate, for each of the four categories above, the factors which
determine the tests to use, rather than give the details of the test themselves.
These can be found in many standard statistical texts. Good introductory
texts are Mendenhall et al.22 and Winkler and Hayes23. A rigorous
mathematical development will be found in Kendall and Stuart24. Noether25
is useful for non-parametric statistics.
The scale of measurement of the data determines whether classical or non-
parametric tests are appropriate. Classical statistical techniques such as T
tests, F tests (analysis of variance or ANOVA), regression, and product-
moment correlation may be applied if:
(1) the population is known to be normally distributed, or
(2) the data is continuous or discrete with a large set of values and the sample
size is large.
The inclusion of this second category is justified by the Central Limit
Theorem of statistics, which says that, even though the population is non-
normal, the sample means will be approximately normal for large samples.
The Chi Square goodness-of-fit test or the more efficient Kolmogorov-
Smirnov test may be used to test whether or not a sample appears to come
from a normal distribution.
Estimation
In estimation, one uses a sample statistic (the estimator) to estimate a
population parameter. By a sample statistic we mean some quantity which is
calculated from a set of sample observations. For example, the average
precision for all queries put to a retrieval system or the proportion of all users
of a system who are satisfied with it. A sample estimator, such as the sample
mean, is said to be unbiased if its expected value is equal to the population
parameter being estimated.
The estimator is' a random variable, i.e. its value will vary from sample to
sample and will not necessarily be equal to the population parameter. Thus,
in inference, it is useful to associate some kind of probabilities with the errors
that may be made in using the estimator in place of the true population value.
There are two ways of making such probability statements-standard errors
and confidence intervals.
The standard error of an estimator is its standard deviation. It indicates
how much the estimator varies from sample to sample. The greater the
standard error, the lower the reliability of the estimator. The reliability of an
estimator is thus related to its probability distribution. Some useful theorems
of probability theory provide information about two of the most important