IRE Information Retrieval Experiment The pragmatics of information retrieval experimentation chapter Jean M. Tague Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. 98 The pragmatics of information retrieval experimentation `1 t 4 The usual method of determining whether or not two variables are related when at least one of the variables is non-quantitative is by means of the Chi square contingency table test. For example, a sample survey of personnel in an organization gave Table 5.7. TABLE 5.7 Employment class Used online retrieval systems Yes No Manager 5 28 Scientist/engineer 30 4 Technician 15 29 Clerical 10 19 A Chi square statistic calculated for the table would indicate whether system use was dependent on or independent of employment classification. (In fact, the null hypothesis of no relationship is rejected.) When neither variable is qualitative, the relationship between two variables can best be expressed by a single number, a measure of association. The hypothesis of no relationship then reduces to a test of the hypothesis that the measure of association is 0. If both variables are continuous or discrete with many values, the product moment or Pearson correlation coefficient may be used. If sample sizes are large, a transformation of the coefficient will have an approximately normal distribution. This may then be used to test the hypothesis that the correlation, i.e. the linear relationship, between the two variables is 0. It is also possible to test whether or not two samples come from populations with the same correlation, for example, to test whether the correlation between search time and number of retrieved references was the same for two different online systems. If one or both of the variables are measured on a scale which is ordinal or better, then a rank correlation coefficient, either Kendall's tau or Spearman's rho, may be used to measure association. Hypothesis tests similar to those for the product moment coefficient may be applied. The relative efficiency of the test of no correlation using tau as opposed to the product moment correlation when populations are normal is 0.91. Prediction Regression techniques are used to predict the value of a dependent variable from other independent variables. In linear regression, the dependent variable is expressed as a linear function of another variable, for example, cost of a search as a function of number of retrieved documents. In multiple linear regression, it is expressed as a linear function of several other variables, for example search time as a function of number of search statements, number of retrieved documents, and number of unique descriptors. In non- linear regression, it is expressed as a non-linear function of one or more other