IRE Information Retrieval Experiment The pragmatics of information retrieval experimentation chapter Jean M. Tague Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. 92 The pragmatics of information retrieval experimentation Most information retrieval experiments are carried out for one or more of the following purposes: estimation; comparison; exploring relationships; prediction. To describe all the statistical tests which have been proposed for these problems would require many volumes. This chapter, for the most part, will simply indicate, for each of the four categories above, the factors which determine the tests to use, rather than give the details of the test themselves. These can be found in many standard statistical texts. Good introductory texts are Mendenhall et al.22 and Winkler and Hayes23. A rigorous mathematical development will be found in Kendall and Stuart24. Noether25 is useful for non-parametric statistics. The scale of measurement of the data determines whether classical or non- parametric tests are appropriate. Classical statistical techniques such as T tests, F tests (analysis of variance or ANOVA), regression, and product- moment correlation may be applied if: (1) the population is known to be normally distributed, or (2) the data is continuous or discrete with a large set of values and the sample size is large. The inclusion of this second category is justified by the Central Limit Theorem of statistics, which says that, even though the population is non- normal, the sample means will be approximately normal for large samples. The Chi Square goodness-of-fit test or the more efficient Kolmogorov- Smirnov test may be used to test whether or not a sample appears to come from a normal distribution. Estimation In estimation, one uses a sample statistic (the estimator) to estimate a population parameter. By a sample statistic we mean some quantity which is calculated from a set of sample observations. For example, the average precision for all queries put to a retrieval system or the proportion of all users of a system who are satisfied with it. A sample estimator, such as the sample mean, is said to be unbiased if its expected value is equal to the population parameter being estimated. The estimator is' a random variable, i.e. its value will vary from sample to sample and will not necessarily be equal to the population parameter. Thus, in inference, it is useful to associate some kind of probabilities with the errors that may be made in using the estimator in place of the true population value. There are two ways of making such probability statements-standard errors and confidence intervals. The standard error of an estimator is its standard deviation. It indicates how much the estimator varies from sample to sample. The greater the standard error, the lower the reliability of the estimator. The reliability of an estimator is thus related to its probability distribution. Some useful theorems of probability theory provide information about two of the most important