IRE Information Retrieval Experiment Retrieval system tests 1958-1978 chapter Karen Sparck Jones Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. 228 Retrieval system tests 1958-1978 Other tests Outside the two groups of tests discussed were a few others concerned with what we earlier referred to as the retrieval system core: Tague's study of thc role of question terms in matching relevant documents is an example42 Further, supporting the evaluation experiments involving the retrieval system core were some non-evaluative studies, usually of an investigativc rather than experimental character, concerned with such topics as thc character of indexing vocabularies or properties of document sets. Their importance in automatic indexing has already been mentioned: in connection with manual indexing such studies as those of Houston and Wall43 and Heald44 can be mentioned. Round these core tests we can then group studies in other more peripheral areas. Among these are two large subgroups, of user studies and bibliometric studies. User studies naturally began to appear accompanying the develop- ment of novel, large, or automated systems in the 1960s, and a great many have been carried out. Early studies were mostly based on questionnaires, Unfortunately, as such reviewers as Menzel45 and Herner and Herne[OCRerr][OCRerr] noted, many of these studies suffered from methodological failings like poor sampling or the use of ill-designed questionnaires. Bibliometric studies also became popular in the 1960s, boosted by the Science Citation Index, but these too often exhibited methodological failings, especially in the assump- tions made about the propriety of the clustering techniques used. Finally, it should be noted that alongside the work discussed so far, which was explicitly or implicitly concerned with effectiveness, went studies of system efficiency, i.e. cost. Some of the evaluation tests already mentioned1 like van Oot et al's, included cost analyses, but other studies only of costs were carried out in the period (see King47). The development of techniques for conducting cost analyses is of course relevant to that of testing in general. 12.4 ConclusIon on 195[OCRerr]1968 Looking at the decade 1958-1968 as a whole, it is possible to detect some consolidation of actual findings, and some development of testing methods and improvement in experimental standards. The main findi[OCRerr]gs were those mentioned earlier as conclusions to be drawn from the indexing language tests, with the tentative rider from the automatic indexing work that the simple indexing found competitive in the manual tests can be provided automatically. The main findings of the decade were strikingly exemplified by the Cranfield 22, 3and CWRU14 15results, and are well expressed by Saracevic's comments on the latter. Thus in his conclusion to the CWRU Report'4 Saracevic notes, as overall observations about information retrieval systems, the importance of human factors in maintaining adequate performance (a comment endorsed by Lancaster in calling for quality control for Medlars' 2); the fact that system performance can nevertheless only reach a middling level; and that an inverse relationship holds for getting relevant documents and avoiding non-relevant ones. The inverse relation of recall and precision was emphasized by Cleverdon, and, as Lancaster and Mills noted48, as there is an inverse relation, one should design a system for a particular point along