IRE Information Retrieval Experiment Gedanken experimentation: An alternative to traditional system testing? chapter William S. Cooper Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. Examples 207 D E F Presence of clue would make it slightly more likely that the document is satisfactory. A positive clue, but of the weakest sort. Presence of clue would make it a little less likely that the document is useful than if nothing were known about it. A mildly negative correlate of usefulness. Presence of clue would make document a much less likely candidate. A strong indicator of uselessness. If a finer scale of judgements were desired, these `grades' could of course be refined by the usual addition of plus and minus signs. Use of the scale would make the request in the previous example look something like this: IRON B, MANUFACTURING C+, POLLUTION C- Any unweighted terms in a request would be treated by the system as though they had been assigned weight B. The qualitative weights must of course be translated by the system into numeric weights if they are to be manipulated probabilistically, and the rules of translation must be supplied to the program at the time the system is put into operation by the system designer, manager, or other analyst. It would be the responsibility of this analyst to apply for each grade on the scale a numeric value judged to be typical of the probability change factor that a user applying that grade would supply if only he had the time and understanding to do the required gedanken experimentation. The translation data supplied by the analyst might be, say, A: 200; B: 50; C: 10; D: 2; E: 0.5; F: 0.02. Such a table would allow any graded request to be transformed immediately by the system into a numerically weighted request, after which retrieval could proceed as in the previous example. How is the translation table to be arrived at? The simplest option is for the analyst to play the role of user for a few typical requests, perform the necessary gedanken experiments to translate the grades into numbers, and note for each grade the typical numeric weight range he finds himself translating it into. However, since the translation table need only be constructed once, there is also a possibility of some limited `real' experimentation at this stage. That is, the analyst might actually gather enough data to provide a crude empirical estimate of the probability change factors experienced for a sampling of graded clues. The data-gathering would consist in estimating for each clue in the sample the proportion of useful documents in the subset of the collection bearing that clue as opposed to the proportion of useful documents in the collection as a whole, and computing the actual probability change factor from this data. This sort of data-gathering would, alas, resurrect some of the difficulties inherent in classical experimentation. The need to establish an empirical criterion of relevance or usefulness and to apply it to many documents would be chief among these, and may often be a sufficient obstacle in itself to discourage the effort. However, it is important to note that many of the worst difficulties of classical experimentation simply do not arise in the limited, focused kind of data-gathering envisioned here. For instance, since there is no comparison of retrieval performances, the problem of choosing a measure of retrieval effectiveness is avoided. In fact, it may be misleading even to 4