IRE
Information Retrieval Experiment
The pragmatics of information retrieval experimentation
chapter
Jean M. Tague
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
86 The pragmatics of information retrieval experimentation
the interquartile range, or range within which the middle 75 per cent of the
observations lie.
Measures of association will be considered later.
The appropriate measure to use depends to a large degree upon the scale
of the observations. Four scales of measurements are distinguished in social
science research:
nominal-names or categories;
ordinal-ranks;
interval-numbers;
rati[OCRerr]numbers with a zero point.
The last two are the true quantitative scales. In statistical analysis, the
distinction between interval and ratio is of no particular value. A much more
important distinction so far as type of analysis is concerned is between
discrete and continuous variables (i.e. variables which are counts versus
variables which can take any real value in an interval).
Arithmetic operations can be properly applied only to numbers. Thus,
means and standard deviations should not be calculated for ordinal data, but
medians and interquartile ranges, which require only a ranking of the
observations, may be.
Any variable that is essentially a count-such as number of relevant
documents[OCRerr]r some function of counts-such as recall and precision[OCRerr]an
be considered a ratio scale. No value judgement is implied by saying that one
method has twice the precision of a second, one is simply stating a numerical
fact about the ratio of the two values. It does not necessarily mean that the
first method is twice as good as the second, any more than a height of 8 feet
is twice as good as a height of 4 feet. Appropriate methods depend on the
scale of the observations, not their value to the user or other individual. This
point is important because many information retrieval investigators have
shied away from classical statistics when there was no real reason to do so.
Any set of numbers[OCRerr]ounts, proportions, logarithms[OCRerr]an be averaged.
Normality is not essential. It does not affect the validity of descriptive
statistics, although it may affect their value. Normality is important in
determining appropriate tests in statistical inference.
There are, however, problems with averaging recall and precision over a
set of queries. These relate to the method of averaging. Two kinds are
possible:
average of numbers (microaveraging);
average of ratios (macroaveraging).
If four queries have the precision values shown in Table 5.1,
TABLE 5.1
Query No. No. of retrieved No. of relevant Precision
references references
1 25 10 0.6
2 2 0.4
3 10 0.5
4 1 1 1.0
Total 41 18 2.5
4
Ii