IRE Information Retrieval Experiment The methodology of information retrieval experiment chapter Stephen E. Robertson Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. 22 The methodology of information retrieval experiment means of these rules to human-assigned indexing, and to a simpler automatic method. The comparison was intended to be indicative of the possible quality of the technique, rather than a definitive test of the theory. Because of the scale of the project, and because of the difficulty of setting up a test of a complete information storage and retrieval system using different kinds of indexing, Harter decided to restrict the experiment to the indexing stage only. He therefore required a collection of documents, indexed by a human indexer, but no actual requests. The document collection also had to be in some sense realistic (as collection, not just as individual documents), and the documents had to be in continuous text form. Harter chose an existing collection of 650 abstracts of the work of Sigmund Freud. The whole collection was used for the statistical analysis of term occurrence on which the automatic indexing rule was based, and the actual comparison of index terms was carried out on a random sample of 38 of these. For the purposes of the experiment, the human-assigned indexing was regarded as the norm, and the object of the two automatic methods was assumed to be to duplicate as far as possible this norm. Thus the rationale of the experiment depends heavily on the assumption that the human-assigned indexing is `good'. Indeed, one might regard this procedure, in terms of the archetype, as using the set of human-assigned index terms as artificial single- term queries, and the human assignments as relevance judgements. Harter's use of a genuine collection of documents but highly artificial queries is justified by the aims and circumstances of the test. So Harter and Oddy each chose to make certain aspects of their respective experiments as realistic as possible, but to allow artificiality in others, in effect selecting from the archetype in a manner appropriate to their objectives and resources. Portable test collectlons It will be clear from all that has gone before that any retrieval test involves a considerable amount of effort, much of which goes into setting up the test collection-that is, the collection of documents, requests and relevance judgements. Even in an operational environment, where the document collection (with indexing) is given, the queries have to be trapped at a suitable point, and the relevance judgements obtained. Many laboratory tests also involve some kind of indexing; and in any case, laboratory testers seldom have easy access to sources of queries and relevance judgements. For these reasons, it has become common for complete test collections to be passed from researcher to researcher, and re-used many times. The best- known collection to suffer this fate is certainly the collection used in the second Cranfield experiment, described above; indeed, it would be fair to say that this collection has been grossly over-used, in the sense that it has been used for experiments which were far removed from those for which it was designed. On the other hand, given the existence of such a collection, a researcher in a laboratory environment is unlikely to feel justified or motivated to set up a new one. There are in fact a number of collections which are used in this way: indeed, there are researchers who have become, defacto, the brokers for such collections, notably K. Sparck Jones in the UK and 0. Salton in the USA.