IRE Information Retrieval Experiment Retrieval system tests 1958-1978 chapter Karen Sparck Jones Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. 244 Retrieval system tests 1958[OCRerr]19'78 Schumacher et al's did not, as noted, distinguish closely related factors likc indexing source and quality and index description source and quality, and Svenonius' experiment suffered from similar defects. In evaluation, Cleverdon indulged in some not very well justified performance extrapolation. In a good many tests it is difficult to attribute any significance to the absolute values of relative recall obtained, and indeed, as noted in connection with searching, perhaps to comparative values. The most welcome feature of the decade has been the increase in test collection size, though reports of operational tests in particular tend not to indicate precise size. There are nevertheless far too many tests using quite small document sets and, much more importantly, small numbers of requests. For example Miller used only 25 requests, Cleverdon has often used less than 20, Katzer used 18, and Cameron 12. It is very doubtful whether the smaller sets produce results which can be regarded as more than suggestive for other contexts. The largest request sets used were UKCIS' 193 and Leggate's 160. It is particularly regrettable that the many Smart Project tests have typically used small collections, in recent years three consisting of some 25 requests and 450 documents each. Several studies, like UKCIS', have unfortunately also used different sizes of request set in individual, closely related experiments. To complete the detailed discussion, we may note that as in the previous decade, the main body of evaluation tests on retrieval system core factors discussed so far was surrounded by other studies of different types. These have also followed the changing trends of the decade. Non-evaluation studies in the core area include a number, especially in the earlier part of the decade, in the area of automatic indexing, like those of Artandi and Wolf' [OCRerr] Carroll and Roelloffs1 12, Williams' 13, Harter1 14, 115, and Field1 16, all concerned with terms, and of Litofsky117 and Schiminovich118 on document clustering. They all claimed some degree of plausibility or merit in the devices studied. Outside the five groups discussed, or at least on a higher and more comprehensive level, have been service oriented tests and studies like those of Rowlands on 5D188, Hansen on Chemkal Abstracts costs90, and Simkins[OCRerr]' and Pollitt92 comparing services for particular user communities. Investiga- tions like those of Lancaster, et al.89 and Leggate et al.58, briefly mentioned earlier in connection with searching, really fall into this category. Thesc studies naturally reflect consumer interest in the increasing range of competing services but incidentally, as is shown by Pollitt's study, provide valuable raw data. In the more peripheral areas there have again been many studies of users, and a whole range of bibliometric investigations, for example of citation patterns. There has also been an increasing interest in data base coverage and overlap. 12.7 The outcome of 20 years' testing What conclusions can be drawn about the state of information retrieval research from such a survey? More specifically, what progress has been made over the last 20 years in obtaining substantively valuable results from methodologically sound experiments?