IRE Information Retrieval Experiment Gedanken experimentation: An alternative to traditional system testing? chapter William S. Cooper Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. Probability and utility theory in system design 203 effect. And when a number of qualitative clues have been combined in order to determine the rank of each document, it is sometimes far from clear that this effect is actually obtained. As a simple example, suppose that in a system accepting weighted requests document A is indexed with one term appearing in the request with weight 0.6, while document B is indexed with two terms each appearing in that same request with weight 0.3. Under a retrieval rule commonly used in such systems (the `vector product' rule), the two documents would be given equal priority in the output ranking. Yet it is not at all clear that the two have an equal probability of satisfying the user, for the request weights do not necessarily represent estimates of probabilities or functions of probabilities, nor does the retrieval rule constitute an appropriate probabil- istic computation. In a system based more firmly on probability theory (or utility theory) this problem would be alleviated. Let us call an information retrieval system explicitly probabilistic if it has the following characteristics: (1) all numeric system parameters, including any request term weights, index term weights, constants used in the ranking algorithm, numbers used as linkage strength indicators in a thesaurus, etc., have clear probabilistic interpretations as estimates of values of algebraic expressions within the standard probability calculus; (2) all `binary' system parameters, e.g. the assignment or non- assignment of an index term to a document in a system with unweighted indexing, have clear interpretations as judgements of whether the value of one probabilistic expression exceeds that of another; (3) the retrieval rule is essentially to rank the documents of the collection for the user in order of decreasing estimated probability of satisfaction to him, where the probability estimates in question are calculated from the various numeric and binary parameters already mentioned, possibly with the aid of appropriate probabilistic independence assumptions, and all on the basis of formulae derivable within the probability calculus. (An explicitly utility-theoretic information retrieval system would have a slightly more general definition admitting terms for expected utilities as well as probabilities.) So far as I am aware no explicitly probabilistic (or explicitly utility-theoretic) systems have ever been put into operation and exploited as such, though there is by now a growing literature bearing on various aspects of how such systems might be designed4-8. An explicitly probabilistic system is presently being pro- grammed for experimental and demonstration purposes at the School of Library and Information Studies, University of California, Berkeley. The significance of explicitly probabilistic systems for us here is that, since the system parameters have clear probabilistic (or utility-theoretic) inter- pretations, the task of estimating them becomes susceptible to techniques of gedanken experimentation. The fact that the retrieval rule is based on the probability calculus guarantees that these parameter estimates will be exploited to produce the best output ranking obtainable from the data available to the system. True, the output ranking will be no better than the input estimates, but neither will it be any worse. From this it may be seen that by comparing parameter estimation methods it may be possible to replace the comparative testing of whole systems by restricted data-gathering aimed directly at the question of how good the estimates are. In cases where one method of estimation is obviously more accurate than another, the need for experimental comparison is eliminated entirely.