IRE
Information Retrieval Experiment
Gedanken experimentation: An alternative to traditional system testing?
chapter
William S. Cooper
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
Probability and utility theory in system design 203
effect. And when a number of qualitative clues have been combined in order
to determine the rank of each document, it is sometimes far from clear that
this effect is actually obtained. As a simple example, suppose that in a system
accepting weighted requests document A is indexed with one term appearing
in the request with weight 0.6, while document B is indexed with two terms
each appearing in that same request with weight 0.3. Under a retrieval rule
commonly used in such systems (the `vector product' rule), the two documents
would be given equal priority in the output ranking. Yet it is not at all clear
that the two have an equal probability of satisfying the user, for the request
weights do not necessarily represent estimates of probabilities or functions of
probabilities, nor does the retrieval rule constitute an appropriate probabil-
istic computation.
In a system based more firmly on probability theory (or utility theory) this
problem would be alleviated. Let us call an information retrieval system
explicitly probabilistic if it has the following characteristics: (1) all numeric
system parameters, including any request term weights, index term weights,
constants used in the ranking algorithm, numbers used as linkage strength
indicators in a thesaurus, etc., have clear probabilistic interpretations as
estimates of values of algebraic expressions within the standard probability
calculus; (2) all `binary' system parameters, e.g. the assignment or non-
assignment of an index term to a document in a system with unweighted
indexing, have clear interpretations as judgements of whether the value of
one probabilistic expression exceeds that of another; (3) the retrieval rule is
essentially to rank the documents of the collection for the user in order of
decreasing estimated probability of satisfaction to him, where the probability
estimates in question are calculated from the various numeric and binary
parameters already mentioned, possibly with the aid of appropriate
probabilistic independence assumptions, and all on the basis of formulae
derivable within the probability calculus. (An explicitly utility-theoretic
information retrieval system would have a slightly more general definition
admitting terms for expected utilities as well as probabilities.) So far as I am
aware no explicitly probabilistic (or explicitly utility-theoretic) systems have
ever been put into operation and exploited as such, though there is by now a
growing literature bearing on various aspects of how such systems might be
designed4-8. An explicitly probabilistic system is presently being pro-
grammed for experimental and demonstration purposes at the School of
Library and Information Studies, University of California, Berkeley.
The significance of explicitly probabilistic systems for us here is that, since
the system parameters have clear probabilistic (or utility-theoretic) inter-
pretations, the task of estimating them becomes susceptible to techniques of
gedanken experimentation. The fact that the retrieval rule is based on the
probability calculus guarantees that these parameter estimates will be
exploited to produce the best output ranking obtainable from the data
available to the system. True, the output ranking will be no better than the
input estimates, but neither will it be any worse. From this it may be seen
that by comparing parameter estimation methods it may be possible to
replace the comparative testing of whole systems by restricted data-gathering
aimed directly at the question of how good the estimates are. In cases where
one method of estimation is obviously more accurate than another, the need
for experimental comparison is eliminated entirely.