IRE Information Retrieval Experiment Gedanken experimentation: An alternative to traditional system testing? chapter William S. Cooper Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. 202 Gedanken experimentation: An alternative to traditional system testing? matter be tried out in all possible combinations. There are logical and conceptual difficulties involved in choosing a measure of retrieval effective- ness from among the dozens that have been proposed, and though from the experimenter's point of view the problem can be sidestepped to some extent by the simple expedient of reporting all results in terms of several different measures, the nagging question of which measure to put faith in is not thereby answered but merely passed along from the experimenter to the ultimate decision-maker. This is only a partial listing of the hazards of experimentation of the traditional sort. It is hardly surprising, then, that on the rare occasions when the methodology used in a large-scale retrieval test has been subjected to careful independent scrutiny, the results have been far from reassuring (see especially Swanson's and Harter's critiques of the Cranfield experiments 1[OCRerr]3) Thus there is a serious question whether full-scale retrieval experiments are worth their high cost. True, it often seems possible to glean at least some hint of a useful generality from reports of such experimentation. However, the danger of mistaking an experimental artifact for a generalizable conclusion is great, and the likelihood that a test result will eventually affect the design of future information systems for the better is small. No single obstacle seems insurmountable in itself, but in combination they are formidable. I raise the question of whether traditional experimentation is worth while here partly in order to provide a foil for the other chapter writers, but partly too because it seems to me that `inspired tinkering' may be an alternative path to retrieval progress which is both easier and likelier of success. 11.2 Probability and utility theory in system design The truism was mentioned earlier that a system should be designed to retrieve those documents most likely to satisfy the user. This being so, a retrieval system may be regarded as a device for estimating probabilities of satisfaction, and possibly also degrees of satisfaction, the aim being to lead the user to examine first those documents with a high probability of providing a high degree of satisfaction. If there is any general underlying theory of retrieval, then, it would appear that we must seek it in a theory of probability, and in order to quantify the notion of `satisfaction' possibly also in the theory of utility, or decision theory. The theory of retrieval per se may be thin, but by regarding the retrieval problem as a problem in probability (and possibly utility) estimation, we may at least endow it with the structure of a branch of applied statistics. The probabilistic approach is of course implicit in all sensible system designs insofar as it is expected that documents in the retrieved set are more likely to satisfy the user than the rest, or in the case of ranked output that documents higher in the ranking are more likely to satisfy than those of lower rank. However, in most present-day systems there is no explicit computation of numeric probability estimates. Rather, crude qualitative criteria are applied which are thought to produce an approximate probability-ranking