IRE
Information Retrieval Experiment
The pragmatics of information retrieval experimentation
chapter
Jean M. Tague
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
94 The pragmatics of information retrieval experimentation
Similarly, the standard error for the system precision estimator, it, is
[OCRerr]mn(py/it))
and is approximated by
[OCRerr]L[OCRerr]=la[OCRerr]Zl[OCRerr] 1b) 1/2
{ ([OCRerr][OCRerr]i (a[OCRerr]+b[OCRerr]))3)
a[OCRerr] is the number of relevant and retrieved documents for the ith query, b[OCRerr] is
the number of non-relevant and retrieved documents for the ith query, C[OCRerr] 15
the number of relevant and non-retrieved documents for the ith query. The
estimators p and it will be approximately normal for large samples and hence
their standard errors can be used to set up confidence intervals in the manner
described above.
Comparison
In statistical inference, comparisons are carried out by hypothesis tests, i.e.
tests of the null hypothesis that there is no difference among or between the
treatments or factor levels. If the null hypothesis is rejected, then a difference
has been shown to exist, with specified probabilities of making a wrong
decision. The general procedure for hypothesis testing is as follows:
(1) State the null hypothesis HO and the alternative hypothesis H 1. The null
hypothesis is generally the hypothesis of no difference, the alternative
hypothesis may specify a difference in either direction or a difference in
one direction only (e.g. one value > the other).
(2) Set a significance level, usually denoted [OCRerr]. The significance is the
probability the null hypothesis will be rejected when it is actually true. It
limits the probability of such Type 1 errors. A Type 2 error occurs when
the null hypothesis is accepted when it is false. Its probability is denoted
[OCRerr] and 1 - fi is called the power of the test. For a fixed sample size, usually
as [OCRerr] is increased fi decreases. In the usual hypothesis test only [OCRerr] is limited;
however P may also be limited, in some tests, by an appropriate sample
size. The usual significance levels are 0.05 or 0.01.
(3) Select a random sample from the population or populations being tested.
(4) From the sample values calculate the value of an appropriate test
statistic. Like an estimator, the test statistic is a random vanable. Its
distribution under the null hypothesis must be known. Some commonly
occurring test statistics are Z (standard normal deviate), Student's T, F
(in ANOVA), and Chi square.
(5) Compare the value of the test statistic with the critical value or values in
tables of the appropriate probability distribution under the null
hypothesis The critical value will be that table value which will give a
probability of [OCRerr] of rejecting HO.
(6) If the test statistic value lies outside (usually greater than and/or less
than) the critical value or values, the null hypothesis is rejected. Otherwise
it is accepted.
I
-I
Hypothesis tests which may be used to compare two factors or treatments-
for example, two search strategies, two indexing methods, two kinds of