IRE Information Retrieval Experiment The pragmatics of information retrieval experimentation chapter Jean M. Tague Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. 86 The pragmatics of information retrieval experimentation the interquartile range, or range within which the middle 75 per cent of the observations lie. Measures of association will be considered later. The appropriate measure to use depends to a large degree upon the scale of the observations. Four scales of measurements are distinguished in social science research: nominal-names or categories; ordinal-ranks; interval-numbers; rati[OCRerr]numbers with a zero point. The last two are the true quantitative scales. In statistical analysis, the distinction between interval and ratio is of no particular value. A much more important distinction so far as type of analysis is concerned is between discrete and continuous variables (i.e. variables which are counts versus variables which can take any real value in an interval). Arithmetic operations can be properly applied only to numbers. Thus, means and standard deviations should not be calculated for ordinal data, but medians and interquartile ranges, which require only a ranking of the observations, may be. Any variable that is essentially a count-such as number of relevant documents[OCRerr]r some function of counts-such as recall and precision[OCRerr]an be considered a ratio scale. No value judgement is implied by saying that one method has twice the precision of a second, one is simply stating a numerical fact about the ratio of the two values. It does not necessarily mean that the first method is twice as good as the second, any more than a height of 8 feet is twice as good as a height of 4 feet. Appropriate methods depend on the scale of the observations, not their value to the user or other individual. This point is important because many information retrieval investigators have shied away from classical statistics when there was no real reason to do so. Any set of numbers[OCRerr]ounts, proportions, logarithms[OCRerr]an be averaged. Normality is not essential. It does not affect the validity of descriptive statistics, although it may affect their value. Normality is important in determining appropriate tests in statistical inference. There are, however, problems with averaging recall and precision over a set of queries. These relate to the method of averaging. Two kinds are possible: average of numbers (microaveraging); average of ratios (macroaveraging). If four queries have the precision values shown in Table 5.1, TABLE 5.1 Query No. No. of retrieved No. of relevant Precision references references 1 25 10 0.6 2 2 0.4 3 10 0.5 4 1 1 1.0 Total 41 18 2.5 4 Ii