IRE Information Retrieval Experiment Introduction chapter Karen Sparck Jones Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. 2 Introduction have been obtained and experience gained, a review of this research serves mainly to show how much more needs to be done. Individual experiments are still far too often methodologically inadequate, in some cases in obvious matters of design, in others in less obvious but nevertheless very important matters of scale. Over the years, the character of experimental work in information retrieval has changed. Many early tests focused on the new information processing methods associated with automation: they were concerned either with the application of post-coordinate searching using manual indexing, or more ambitiously, with wholly automatic indexing techniques. However the main outcome of this work was the discovery that nobody really knew, in detail, how document retrieval systems behave or more importantly, why they behave in the way they do. Many test results were unexpected or ambiguous, and it became evident that this was due to weak experimental design, which was in turn a consequence of unexamined assumptions about retrieval systems. The need for a better standard of experiment, preferably informed by an explicit characterization of system properties, has been slowly recognized, and the quality of experiments has improved. Some research workers have in particular been able to capitalize on previous work through the use of common data, and also through the application of common techniques for system performance measurement. Thus although specific tests may have been done within the framework of particular operational systems and their assumptions, much of the research that has been done has been devoted to a painstaking analysis of typical system behaviour, over a wide range of factors. These developments are in many ways illustrated by the objectives of the Cranfield 2 test, its results, and its subsequent influence on information retrieval research. The test was designed to compare the performance of different indexing languages, including simple natural language, in a laboratory environment. It was intended to improve on Cranfield 1 in experimental design, and was systematically conducted. The most striking result, the competitive performance of post-coordinate natural language terms, was not expected, but has been largely supported by subsequent experiments. The test also indicated that only medium levels of performance could be expected of retrieval systems. The inverse relationship between recall and precision was clearly displayed, and subsequently adopted as a generalization about retrieval systems, if not as a law. The test was criticized for methodological inadequacies, for example in the way the test data was generated, and for being too small in scale. Further, while the test's concern with index languages and their application was clearly focused on the centre of retrieval systems, important factors, especially those involving users, were not studied. It has also been argued that recall and precision have been overvalued as measures and, more generally, that the bottom-up approach to the understanding of retrieval systems represented by Cranfield-type experiments may be unproductive or misleading, and that a top-down approach, guided by a theory or model, is preferable. The impact of the Cranfield 2 test on later research has nevertheless been considerable. Specific projects have applied Cranfield procedures and used the Cranfield test data for comparative purposes. More broadly, much of the experimental research done since Cranfield 2 has been in the same tradition,