IRE
Information Retrieval Experiment
Gedanken experimentation: An alternative to traditional system testing?
chapter
William S. Cooper
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
202 Gedanken experimentation: An alternative to traditional system testing?
matter be tried out in all possible combinations. There are logical and
conceptual difficulties involved in choosing a measure of retrieval effective-
ness from among the dozens that have been proposed, and though from the
experimenter's point of view the problem can be sidestepped to some extent
by the simple expedient of reporting all results in terms of several different
measures, the nagging question of which measure to put faith in is not
thereby answered but merely passed along from the experimenter to the
ultimate decision-maker. This is only a partial listing of the hazards of
experimentation of the traditional sort. It is hardly surprising, then, that on
the rare occasions when the methodology used in a large-scale retrieval test
has been subjected to careful independent scrutiny, the results have been far
from reassuring (see especially Swanson's and Harter's critiques of the
Cranfield experiments 1[OCRerr]3)
Thus there is a serious question whether full-scale retrieval experiments
are worth their high cost. True, it often seems possible to glean at least some
hint of a useful generality from reports of such experimentation. However,
the danger of mistaking an experimental artifact for a generalizable
conclusion is great, and the likelihood that a test result will eventually affect
the design of future information systems for the better is small. No single
obstacle seems insurmountable in itself, but in combination they are
formidable. I raise the question of whether traditional experimentation is
worth while here partly in order to provide a foil for the other chapter
writers, but partly too because it seems to me that `inspired tinkering' may be
an alternative path to retrieval progress which is both easier and likelier of
success.
11.2 Probability and utility theory in system design
The truism was mentioned earlier that a system should be designed to retrieve
those documents most likely to satisfy the user. This being so, a retrieval
system may be regarded as a device for estimating probabilities of
satisfaction, and possibly also degrees of satisfaction, the aim being to lead
the user to examine first those documents with a high probability of providing
a high degree of satisfaction. If there is any general underlying theory of
retrieval, then, it would appear that we must seek it in a theory of probability,
and in order to quantify the notion of `satisfaction' possibly also in the theory
of utility, or decision theory. The theory of retrieval per se may be thin, but
by regarding the retrieval problem as a problem in probability (and possibly
utility) estimation, we may at least endow it with the structure of a branch of
applied statistics.
The probabilistic approach is of course implicit in all sensible system
designs insofar as it is expected that documents in the retrieved set are more
likely to satisfy the user than the rest, or in the case of ranked output that
documents higher in the ranking are more likely to satisfy than those of lower
rank. However, in most present-day systems there is no explicit computation
of numeric probability estimates. Rather, crude qualitative criteria are
applied which are thought to produce an approximate probability-ranking