IRE
Information Retrieval Experiment
The pragmatics of information retrieval experimentation
chapter
Jean M. Tague
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
98 The pragmatics of information retrieval experimentation `1
t
4
The usual method of determining whether or not two variables are related
when at least one of the variables is non-quantitative is by means of the Chi
square contingency table test. For example, a sample survey of personnel in
an organization gave Table 5.7.
TABLE 5.7
Employment class Used online retrieval systems
Yes No
Manager 5 28
Scientist/engineer 30 4
Technician 15 29
Clerical 10 19
A Chi square statistic calculated for the table would indicate whether
system use was dependent on or independent of employment classification.
(In fact, the null hypothesis of no relationship is rejected.)
When neither variable is qualitative, the relationship between two
variables can best be expressed by a single number, a measure of association.
The hypothesis of no relationship then reduces to a test of the hypothesis that
the measure of association is 0.
If both variables are continuous or discrete with many values, the product
moment or Pearson correlation coefficient may be used. If sample sizes are
large, a transformation of the coefficient will have an approximately normal
distribution. This may then be used to test the hypothesis that the correlation,
i.e. the linear relationship, between the two variables is 0. It is also possible
to test whether or not two samples come from populations with the same
correlation, for example, to test whether the correlation between search time
and number of retrieved references was the same for two different online
systems.
If one or both of the variables are measured on a scale which is ordinal or
better, then a rank correlation coefficient, either Kendall's tau or Spearman's
rho, may be used to measure association. Hypothesis tests similar to those for
the product moment coefficient may be applied. The relative efficiency of the
test of no correlation using tau as opposed to the product moment correlation
when populations are normal is 0.91.
Prediction
Regression techniques are used to predict the value of a dependent variable
from other independent variables. In linear regression, the dependent
variable is expressed as a linear function of another variable, for example,
cost of a search as a function of number of retrieved documents. In multiple
linear regression, it is expressed as a linear function of several other variables,
for example search time as a function of number of search statements,
number of retrieved documents, and number of unique descriptors. In non-
linear regression, it is expressed as a non-linear function of one or more other