IRE Information Retrieval Experiment Retrieval system tests 1958-1978 chapter Karen Sparck Jones Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. 232 Retrieval system tests 1958-1978 based, and so are connected with other statistical approaches to retrieval. The statistical work of 1968-1978 is in turn linked with that of the earlier decade. As noted, much of the automatic indexing research done between 1958 and 1968 did not progress as far as evaluative performance testing. The early 1970s saw reports on statistical term cluster evaluation by Vaswani and Cameron78 and Sparck Jones79' 80, and recent experiments by Harper and van Rijsbergen81 have specifically combined the use of term associations with that of relevance weights. Other rather crude methods of non-statistical automatic (or in principle automatic) indexing are represented by O'Connor's work on passage retrieval82' 83 and by Atherton's BOOKS project84. Klingbiel and Rinker85 and Evans86 report tests of semi-automatic indexing involving some reference to a dictionary or thesaurus. In general statistical clustering has proved very disappointing, and the main thrust of statistical work has been on the more promising weighting. For the purposes of discussion we can therefore consider two groups of tests: those on automatic and especially statistical indexing not involving relevance information, and those on relevance feedback and weighting. Some of the language, indexing, and searching tests were carried out in the context of operation services, for example by Aitchison et al., Barker et al.1 Olive et al. and, recently, Cleverdon87. There have also been more restricted investigations, rather than experiments proper, relating to services, such as those carried out by Rowlands88, Lancaster, Rapport and Kiffin Penry899 Leggate et al., Hansen90, Simkins91, and Pollitt92. Such operational tests were often concerned, and perhaps more than those of the previous decade, with cost efficiency as well as performance effectiveness, and some studies1 like that of Katzer57, have been wholly devoted to costs. The increasing volume of information and development of information services have also been matched by a corresponding growth of user studies, data base coverage investigations, and so on. There have also been many bibliometric studies, some of a very academic character. Overall during this period we can detect two major strands in testing, reflecting an increasing divergence between the concerns of operational system managers and those of research workers. Projects under the first head concentrated initially on indexing languages and then on the related topics of indexing and searching. Research workers have also concentrated increasingly on searching, as in the relevance weighting experiments, but within the framework of theoretical approaches implying sophisticated procedures like output ratiking. It is therefore paradoxical that some research findings which appear particularly suited to modern computer systems should have made no impact on the operational scene. From a more intellectual point of view it will be evident that we can combine the five topic groups listed to form two broader groups of test which in fact continue the previous decade's interests in manual and automatic systems respectively. Thus the work on index languages, on indexing, and on user searching strategies is all oriented towards manual systems or the human elements of automatic systems. The work on statistical or other `mechanical' forms of indexing, and on statistical and `mechanical' techniques for query modification, on the other hand, is a continuation of the automatic indexing research of the 1 960s. In what follows, these two broad groupings should be borne in mind, though the detailed discussion is more conveniently and,