IRE Information Retrieval Experiment The methodology of information retrieval experiment chapter Stephen E. Robertson Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. Statistical ideas and questions 25 One might say, then, that the state of the art consists of a number of more- or-less standard statistical techniques applied to the query set. It should be remembered, however, that such an approach deals with only part of the statistical problem. Parametric and non-parametric statistics Most methods of statistical inference (such as significance tests) in common use are based on assumptions about the population from which the sample is drawn, and in particular on assumptions about the distribution (in the population) of the particular variable being measured. Thus for example many significance tests assume an underlying normal (gaussian) distribution. Any such statistical method is described as `parametric'. Unfortunately, many of the variables that one commonly wishes to measure in retrieval tests do not satisfy these criteria. A good example is r[OCRerr]'call: that is, the proportion of the relevant documents that are retrieved. Because there may well be fev[OCRerr] [OCRerr]levant documents for any given query, and because the values of recall for individual queries may be very widely spread, the distribution of recall values over queries tends to look very strange indeed. In particular, one tends to find many occurrences of the extreme values (0 or 100 per cent), and many occurrences of those values that happen to be low-denominator fractions (e.g. 75, 33, 60 per cent). Under such circumstances it is offen difficult to find suitable parametric assumptions, and one has to have recourse to non-parametric methods. This is a fairly severe limitation: the range of non-parametric methods is somewhat restricted. Sample size Fven supposing that the variable we are measuring would allow us, in principle, to apply some particular statistical test, are we likely to be able to obtain adequate samples of documents and queries for the test? This question has several aspects; I will consider first the purely statistical aspect of sample size. As implied above, the documents seldom represent a problem in this context: it is normally easy enough to get hold of, and to input into the system(s), quite sufficient numbers of documents. (This is easiest if the documents are available in a suitable form; most difficult if some tundamentally new form of indexing has to be applied to them; but either way can be done given only sufficient resources.) The real problem arises with the queries. I have suggested that `trapping' the queries at an appropriate moment of their existence, and obtaining the necessary co-operation of the requesters, is by no means a trivial task. There is some evidence to suggest that the results of many past tests, relying on tens rather than hundreds of queries, are of doubtful validity for that reason if for no other. The problem is compounded by the large range of variation between queries of almost any variable of interest, and the comparatively small differences between systems that seem to be common. But the question of sample adequicy is very much wider than that of numbers. We have to consider whether we can take a genuinely random