IRE Information Retrieval Experiment Retrieval system tests 1958-1978 chapter Karen Sparck Jones Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. 230 Retrieval system tests l958[OCRerr]l978 Most importantly, there is an inadequate understanding of controls in experimentation with IR systems and the controls are essential in monitoring the factors under consideration and distinctly sorting out the factors contributing to the performance. There is a lack of an effort to cumulate and synthesise knowledge on IR systems as it exists.' (Ref. 14 Part II, pp. 183A) In general, therefore, the situation at the end of the first decade of information retrieval system testing was that while the test results tended (broadly) to agree on what happens in retrieval systems, they did not sufficiently explain why it happens. In particular, at the more detailed level, in cases where performance differences were observed, these were not always attributable to specific system factors or, more importantly, to the interplay between system factors. It was thus not at all obvious how systems should be designed to perform well, modulo a preference for recall or precision, in particular environments, especially outside established frameworks like those repre- sented by the Medlars system, or for situations and needs clearly resembling those of existing systems. It was even less evident how `optimal', i.e. attainably good, performance was to be achieved for a given area of thc recall-precision spectrum. For while there is a general inverse relationship, it does not follow that for a specific value of precision (or recall) one cannot establish a better average recall (or precision) than the current one. One needs at any rate to know whether the current performance level is a good one. Greater understanding was thus the prime need in the next decade's testing. 12.5 The decade 1968-1978 The testing work of the decade 1968-1978 differs from that of 1958-1968. It shows both a shift in the main topics of concern and, especially in laboratory work, greater refinement in the attempt to distinguish and control variables. The volume of experimental work seems to have been greatest in the earlier part of the decade, with a number of projects in particular stimulated by the major tests of the previous decade like CWRU's and Cranfield 2. In the latter part of the 1970s there has been a noticeable decline in the number of laboratory experiments, presumably because the rapid extension of online services has been widely, though in some opinions too uncritically, accepted as solving all the information user's problems. This development has been naturally associated with service investigations and management and cost- oriented studies. Overall, the evaluation tests of the decade fall into five major groups, compared with the two of the previous decade, though these five groups do perhaps, as we shall see, fall into two very broad classes roughly corresponding to the two groups of the previous decade. In the early 1 970s there were a number of reports on manual index language tests of the kind conspicuous in the previous decade; and indeed these projects had typically been started in the late 1 960s: examples are the tests done by Aitchison et al.49 and Olive, Terry and Datta50, and Keen's ISILT experiment51 52 Some of the tests involved retrieval using different