IRE Information Retrieval Experiment The methodology of information retrieval experiment chapter Stephen E. Robertson Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. Some examples 21 a major part in creating the archetype as I have described it. Clearly they lie at opposite poles of the operational-laboratory spectrum, Cranfield 2 being It highly controlled and artificial experiment, and Medlars being an investigation (in the sense defined in the editor's introduction) of an operational system, as far as possible under realistic conditions. However, hetween them they illustrate well the main characteristics of the archetype. In particular, Cranfield illustrates the necessity for having a complete system, even if only part of it is under test. Both tests used genuine d(lcuments; Medlars used genuine queries and Cranfield artificial (or reconstructed) ones. Cranfield used an experimental design involving replicated searches; Medlars could not. Both tests used relevance judgements hy the requester; in both cases this precluded exhaustive scanning of the collection, though for Cranfield one might assume that the relevance sets are [OCRerr]lmost complete; and so on. The two experiments described below are on a smaller scale, with more limited objectives (each, in fact, forming part of a PhD project). Oddy: Thomas R. N. Oddy developed a program for computer searching, called Thomas, with a strong interactive facility. The basic idea was that the system should huild up, from its dialogue with the user, an internal image of the user's need. oddy conducted a test of the program, designed to establish its feasibility Lind some approximate idea of its quality, rather than to measure in any very reti ned sense its performance. For the document collection, a selection of 225 references (complete with :n(1exing) was made from the Medlars data base; for the queries, 32 searches rc[OCRerr]'ulting from genuine requests put to the Medusa system were used. Since (clevance judgements on the output of Medusa searches were obtained as a ili[OCRerr]ittcr of course by the system, these were available to Oddy for the test of l~homas. Uhe test itself involved simulating a user interaction with the system, on the basis of all the information available to Oddy (statement of the request [OCRerr] record of the search process on Medusa, and relevance judgements on tt[OCRerr]e output). Clearly this information is incomplete, in respect of both the `IeLlrch process that the user might have followed with Thomas and the relevance judgements (the relevance judgements affect the search process as well as the evaluation of the results). Also the very limited selection (not `iLimple) of documents makes generalizing from such an experiment even Illore dangerous than usual. However, in the con[OCRerr]ext of the limited aims of tile test, Oddy's methods are appropriate. His insistence on using genuine (eq uests and relevance judgements, while remaining unconcerned with the Lirtificiality of other aspects of the test, is strictly in keeping with the philosophy of Thomas, and seems eminently reasonable in the circumstances. Ilarter: probabilistic indexing S P. Harter has developed a theory which can be used to derive rules for LItitomatic extraction indexing. In order to subject the theory to test, Harter perf[OCRerr]rmed an experiment comparing the indexing derived automatically by