IRE Information Retrieval Experiment Retrieval system tests 1958-1978 chapter Karen Sparck Jones Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. Conclusion on l958[OCRerr]l968 229 the recall-precision line. These conclusions were endorsed by the multi- collection tests done in the later part of the decade by the Smart Project. Methodologically the Cranfield 2 project showed how informative testing required a more systematic breakdown of a retrieval system into its various factors than was common earlier. Gross comparisons between distinct languages were replaced by a much more detailed study of recall and precision devices generating families of languages. The CWRU project was similarly directed toward a much more careful treatment of system factors in a range of comparative experiments than was generally adopted. However it is interesting to note that the attempt to ensure control in CWRU led to new difficulties, in this case to a very artificial and perhaps perverted treatment of queries, i.e. maintaining constant queries for different languages tended to suppress distinctive features of the languages themselves. The same trend is well shown in the Smart Project work which by 1968 was well into a very large range of detailed studies. In this case the emphasis on automatic systems provided not only new opportunities for system design, for example in permitting ranked output, but also ones for system testing in the comparative ease with which grinding tests over ranges of slightly different variable values could be conducted, and in the application of complex measurement and statistical evaluation techniques. However the sheer proliferation of explicit parameter settings served to bring out not only the increasing numbers of runs needed for proper comparative experiments, but the difficulties of ensuring a meaningful experimental design. The Cranfield 2 and CWRU projects in many ways looked backward, seeking to improve on the initial index language tests of the decade. But they also, in the challenge implied by the comparative flatness of their findings, and in their methodological quality, presented a reference point for the work of the next decade. The Smart Project, while sharing this character to some extent, is a more genuine pointer to future work in more throroughly embracing the possibilities offered by the computer, particularly for sophisticated search strategies and non-conventional methods of massaging term descriptions, for example by numerical weighting and feedback techniques. But even these projects suffered, as already indicated, from many limitations; and the general character of the testing done in the field during the decade is well described by Saracevic: `At present real and productive testing of total retrieval systems, taking into account and controlling all inside and environmental factors, is not feasible and not possible. At present, it seems that generalisable, formal, quantified results of high validity and reliability on all or even on the majority of factors affecting the performance of retrieval systems cannot be attained. The reasons are fairly evident. There is an absence of a well-formulated theory taking into account all or a majority of the factors operating on retrieval systems. There is only an intuitive understanding of objectives of retrieval systems-thus, the measures indicative of the achievement of objectives are not totally reflective of the real objectives and not comprehensive. There is an inadequate knowledge of processes involved within or outside the IR Systems and, without a thorough understanding of processes, comprehensive testing is unattainable. There is a lack of standardised methodologies for experimentation, which precludes testing.