IRE
Information Retrieval Experiment
Retrieval system tests 1958-1978
chapter
Karen Sparck Jones
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
214 Retrieval system tests 1958-1978
of effectiveness, somehow defined. The difference between experiment and
investigation is therefore that in experiment explicit comparative measure-
ments are required for different values of the test variables; in investigation9
comparison may be only implicit. Further, since retrieval systems have a
function, evaluation experiments relating specifically to performance
effectiveness, i.e. the ability of the system to retrieve relevant documents and
to suppress non-relevant ones, have a special status as the most important
kind of experiment.
Unfortunately, the distinction between experiment and investigation just
summarized is an ideal which is very difficult to maintain when discussing
actual system tests. Much of the work done cannot be described as
unequivocally experimental or investigative, expecially where studies or
operational systems are concerned. The problem is really that information
retrieval systems are so complicated, and so little understood, and there is
such a lack of solid theory about them, that really high class experiment can
hardly be expected. In a way a review of information retrieval experiment is
a review of the inadequacy of information retrieval experiment. The work
discussed in this chapter thus ranges from experiments proper to better
conducted and relatively systematic investigations. A particular problem is
that while both experiment and investigation can in principle refer to
operational system studies, in practice there have been few thoroughly
controlled operational system tests, and experiment and investigation
typically imply laboratory and operational environments respectively. There
are indeed, as is noted in other chapters, considerable difficulties about
conducting rigorous operational system experiments.
Within the area of information retrieval experiment we can then sort the
tests done according to the degree of control they involved, and according to
the type of hypothesis they invoked. Control is exhibited by comparison, and
the degree of control corresponds largely to the scope or level of the factor
being studied as the primary experimental variable, i.e. variable on which
the experimenter's interest is focused. Thus at the highest level we may
compare whole indexing and searching subsystems within the fixed
environment represented by a certain body of users and of literature; at the
medium level we may compare different indexing thesauri; and at the lowest
level we may vary indexing exhaustivity using a given thesaurus. As long as
the environment parameters are held constant, all of these are comparisons
implying some degree of control, but in the case of whole indexing or
searching subsystems control will be minimal. The consequence is that any
observed differences (or similarities) in system performance will not be
explicable in any detail since the indexing and searching subsystem as a
whole subsumes many lower-level variables. The problem most severely felt
by research workers has been that of identifying useful, meaningful unit
variables in retrieval systems, i.e. those variables capable of affecting
performance for determinable reasons. A closely related problem is that or
managing secondary, related variables, since their identification and
manipulation are associated with the treatment of the primary variables. The
treatment of indexing exhaustivity and specificity in relation to an index
language are good examples of this problem.
Similar points can be made about the hypotheses underlying information
retrieval experiment. Some hypotheses are rather general, for instance that