IRE
Information Retrieval Experiment
The pragmatics of information retrieval experimentation
chapter
Jean M. Tague
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
Decision 9: How to an[OCRerr]ysc the d[OCRerr]t:t? 09
variables, for example vocabulary size as a logarithmic function of collection
size. Confidence intervals may be set up for predicted values, however, the
accuracy and reliability depends upon an assumption of at least approximate
normality.
Although superficially like the preceding problem, forecasting future
values of some variable on the basis of past values is not really amenable to
regression techniques. This is because regression is based on the assumption
of independent observations. Time series, such as daily use of a system or
monthly recall/precision figures for an SDI profile, are obviously dependent
observations[OCRerr])ne day's or month's value is related to previous ones. Time
series analysis, which consists of analysing a series in terms of trends,
periodic or seasonal components, and random fluctuations, is discussed in
detail in a number of monographs, for example Gilchrist26 and Box and
Jenkins27.
Implementation
Finally, there is the question of the medium for data analysis. There are two
ways:
(1) Manual tabulations, possibly using hand calculators. This is convenient
in the sense that it can be done internally, but may not in the long run be
the least expensive method. It is necessary, of course, for the analysis of
non-formated data. Manual tabulations have a very high probability of
error, so that, to be sure of results, all calculations must be verified. This
can be very tedious, particularly if results do not tally the first time.
(2) Computer-based statistical packages. The chance of error is much
reduced here, though, of course, data input must still be verified. The best
known statistical packages are SPSS (Statistical Package for the Social
Sciences), SAS (Statistical Analysis Package), and BMD (Biomedical
Computer Programs), and it is probably best to use one of these if you are
carrying out a wide range of different types of analysis on the same data.
The actual tests available with these packages vary, to some extent, from
installation to installation. For example, some installations have non-
parametric tests not described in the SPSS Manual. A useful introduction
to the three packages listed above will be found in Moore28.
It is important, however, to understand the function of the different tests
in the packages. Their very comprehensiveness makes them susceptible to
misuse. Anyone contemplating the use of statistical packages should study
the manual carefully prior to data collection. Much time and expense at the
data analysis stage can be saved by collecting data in a form that is amenable
to entry into an SPSS or other package file. Basically, data is entered case by
case, each case consisting of several fields defining characteristics of the case.
Sometimes there is a problem in deciding what is a case. For example, in a
study of retrieval, is a case a searcher, a user, a query, a search, a search
statement. It all depends on the purpose of the analysis. A case should be the
simplest, most atomic experimental unit to be examined in the study. If users
have several queries and queries consist of a sequence of search statements,
and if interest is in the effectiveness of various ways of structuring search
statements, then a case is a single search statement.