IRE Information Retrieval Experiment The pragmatics of information retrieval experimentation chapter Jean M. Tague Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. 94 The pragmatics of information retrieval experimentation Similarly, the standard error for the system precision estimator, it, is [OCRerr]mn(py/it)) and is approximated by [OCRerr]L[OCRerr]=la[OCRerr]Zl[OCRerr] 1b) 1/2 { ([OCRerr][OCRerr]i (a[OCRerr]+b[OCRerr]))3) a[OCRerr] is the number of relevant and retrieved documents for the ith query, b[OCRerr] is the number of non-relevant and retrieved documents for the ith query, C[OCRerr] 15 the number of relevant and non-retrieved documents for the ith query. The estimators p and it will be approximately normal for large samples and hence their standard errors can be used to set up confidence intervals in the manner described above. Comparison In statistical inference, comparisons are carried out by hypothesis tests, i.e. tests of the null hypothesis that there is no difference among or between the treatments or factor levels. If the null hypothesis is rejected, then a difference has been shown to exist, with specified probabilities of making a wrong decision. The general procedure for hypothesis testing is as follows: (1) State the null hypothesis HO and the alternative hypothesis H 1. The null hypothesis is generally the hypothesis of no difference, the alternative hypothesis may specify a difference in either direction or a difference in one direction only (e.g. one value > the other). (2) Set a significance level, usually denoted [OCRerr]. The significance is the probability the null hypothesis will be rejected when it is actually true. It limits the probability of such Type 1 errors. A Type 2 error occurs when the null hypothesis is accepted when it is false. Its probability is denoted [OCRerr] and 1 - fi is called the power of the test. For a fixed sample size, usually as [OCRerr] is increased fi decreases. In the usual hypothesis test only [OCRerr] is limited; however P may also be limited, in some tests, by an appropriate sample size. The usual significance levels are 0.05 or 0.01. (3) Select a random sample from the population or populations being tested. (4) From the sample values calculate the value of an appropriate test statistic. Like an estimator, the test statistic is a random vanable. Its distribution under the null hypothesis must be known. Some commonly occurring test statistics are Z (standard normal deviate), Student's T, F (in ANOVA), and Chi square. (5) Compare the value of the test statistic with the critical value or values in tables of the appropriate probability distribution under the null hypothesis The critical value will be that table value which will give a probability of [OCRerr] of rejecting HO. (6) If the test statistic value lies outside (usually greater than and/or less than) the critical value or values, the null hypothesis is rejected. Otherwise it is accepted. I -I Hypothesis tests which may be used to compare two factors or treatments- for example, two search strategies, two indexing methods, two kinds of