IRE
Information Retrieval Experiment
Gedanken experimentation: An alternative to traditional system testing?
chapter
William S. Cooper
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
Examples 207
D
E
F
Presence of clue would make it slightly more likely that the
document is satisfactory. A positive clue, but of the weakest
sort.
Presence of clue would make it a little less likely that the
document is useful than if nothing were known about it. A
mildly negative correlate of usefulness.
Presence of clue would make document a much less likely
candidate. A strong indicator of uselessness.
If a finer scale of judgements were desired, these `grades' could of course be
refined by the usual addition of plus and minus signs. Use of the scale would
make the request in the previous example look something like this:
IRON B, MANUFACTURING C+, POLLUTION C-
Any unweighted terms in a request would be treated by the system as though
they had been assigned weight B.
The qualitative weights must of course be translated by the system into
numeric weights if they are to be manipulated probabilistically, and the rules
of translation must be supplied to the program at the time the system is put
into operation by the system designer, manager, or other analyst. It would be
the responsibility of this analyst to apply for each grade on the scale a
numeric value judged to be typical of the probability change factor that a
user applying that grade would supply if only he had the time and
understanding to do the required gedanken experimentation. The translation
data supplied by the analyst might be, say, A: 200; B: 50; C: 10; D: 2; E: 0.5;
F: 0.02. Such a table would allow any graded request to be transformed
immediately by the system into a numerically weighted request, after which
retrieval could proceed as in the previous example.
How is the translation table to be arrived at? The simplest option is for the
analyst to play the role of user for a few typical requests, perform the
necessary gedanken experiments to translate the grades into numbers, and
note for each grade the typical numeric weight range he finds himself
translating it into. However, since the translation table need only be
constructed once, there is also a possibility of some limited `real'
experimentation at this stage. That is, the analyst might actually gather
enough data to provide a crude empirical estimate of the probability change
factors experienced for a sampling of graded clues. The data-gathering would
consist in estimating for each clue in the sample the proportion of useful
documents in the subset of the collection bearing that clue as opposed to the
proportion of useful documents in the collection as a whole, and computing
the actual probability change factor from this data.
This sort of data-gathering would, alas, resurrect some of the difficulties
inherent in classical experimentation. The need to establish an empirical
criterion of relevance or usefulness and to apply it to many documents would
be chief among these, and may often be a sufficient obstacle in itself to
discourage the effort. However, it is important to note that many of the worst
difficulties of classical experimentation simply do not arise in the limited,
focused kind of data-gathering envisioned here. For instance, since there is
no comparison of retrieval performances, the problem of choosing a measure
of retrieval effectiveness is avoided. In fact, it may be misleading even to
4