IRE Information Retrieval Experiment Retrieval system tests 1958-1978 chapter Karen Sparck Jones Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. The decade 1968-1978 237 common in the 1 970s, for unweighted as well as weighted searching. It is customary in Smart, for example. Unfortunately, proper comparative experiments in searching present many problems. As Keen notes in describing EPSILON103, and discusses more fully in Chapter 8 in this volume, the difficulties of designing and conducting satisfactory experiments in manual searching are very great; and Barraclough et al.'s study of users' search behaviour59, for example, was a very simple observational one. It is also not clear how radically different logics, like boolean and ranking ones, should be compared: indeed Evans61' 62 asks whether such comparisons are meaningful (see also Chapter 14 in this volume). It should be noted too, that evaluating recall for very different strategies using pooled output may introduce dangerous biases. The searching tests done between 1968 and 1978 were chiefly concerned on the one hand with search logics, and on the other with searcher behaviour, in both cases in operational environments. The many experiments on query modification by relevance feedback, like those carried out by the Smart Project are, as mentioned earlier, more naturally considered under the heading of automatic indexing, since the role of the user in detailed decision making in query modification is very limited, and numerical calculations of a kind justifying a computer are ordinarily involved in the searching. At the detailed level of objective, form and result, the tests on manual searching have rather little in common, not surprisingly considering the complexity of the topic. Both systematic comparisons and generalizations about them can therefore only be limited. The tests on search logics included Evans61' 62 and Miller's68-70 experiments, and also Katzer's cost-oriented investigation57. Some of Aitchison et al's work49 is also relevant, as is that of Cleverdon101. Studies of searcher behaviour were carried out during the period by Barber, Barraclough and Gray104 and Barraclough et al.59, Olive et al.'s test comparing manual scanning for SDI with automated searching was a `semi search test50; and service studies like Lancaster, et al.89, Leggate et al.58, and the UKCIS investigation53' [OCRerr] involved examination of search specifications and searching. The focus of Evans' experiment (see Chapter 14 in this volume) was to compare a range of different term or term group weighting schemes involving ranked output, and also boolean specifications, of Miller's to compare the use of weights (in fact relevance weights) with boolean searches, of Katzer's to compare `grades' of boolean logic, specifically for cost. Cleverdon compared boolean searching with co-ordination ordering, and Aitchison et al. indirectly compared boolean and co-ordination[OCRerr]rdered searching, and also different query formulations (which they called strategies), broad, medium and narrow. Olive et al. compared automatic scanning with human for current awareness, Barber et al. users and experts as searchers. Barraclough et al. simply observed user behaviour in searching online, and the other investigative projects like the UKCIS one just noted features of searches. The motivation for the comparative tests in the first group was to evaulate the less restrictive, especially weighting, schemes, that of the second group the simpler, less expensive approaches, like automatic searching in Olive et al.'s study, having the user rather than an expert search in Barber et a!. `s test. The common assumption was that the simpler approaches were adequate. In form the tests followed the general pattern, the main feature of interest