IRE Information Retrieval Experiment Retrieval system tests 1958-1978 chapter Karen Sparck Jones Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. I Conclusion on 1968-1978 243 document retrieval research with much earlier work and with current research on non-bibliographic databases. 12.6 Conclusion on 196[OCRerr]1978 Overall, when we look at the evaluation tests of the decade from a substantive point of view, we can see on the one hand a rounding out of the work done in the previous decade, and on the other one possible line of non-conventional performance improvement. The experiments as a whole show simple indexing as good as sophisticated as far as the language is concerned, with the actual treatment of the query more important as a determiner of performance than anything else, and performance in any case very difficult to raise above a broad 50 per cent precision-50 per cent recall level. It appears, more specifically, that the requirement to be met to reach this level, is that of adequate exhaustivity of indexing, primarily of the query (unless the user's only concern is with precision). The relevance weighting techniques studied during the decade may not raise performance above the middling level apparently at best attainable in practice, but they may provide a very helpful and cheap way of raising performance from the lower levels likely to be actually attained in practice. It is however the case that the substantive remarks that can be made about the results obtained between 1968 and 1978 are, like those of the previous decade, very general, and the more novel ones are rather tentative. Methodologically, the tests taken together show some improvement over those of the previous decade's, but regrettably not enough. A particular contribution has been made by the use of individual test collections by more than one project, and of several collections by individual projects. In the first case the Cranfield 2 data especially have been widely used, but other test collections like Keen's ISILT one and a UKCIS one have also been utilized by more than one project. This does not of course mean that any defects of the data as created are removed, but at least the results of one project can be related to others, and equally, particular results supported by related tests, including those involving different methods of performance representation. The Smart Project was a pioneer of multi-collection tests, and the importance alike of those tests showing different results for different collections and those showing the same result for different collections, cannot be overestimated. Sparck Jones has also used a range of increasingly large test collections in laboratory experiments. As noted earlier, a good many of the tests of the decade exhibited more careful control of both primary and secondary variables, and a concern with the validity of findings which has led to a wider application of statistical significance tests, as for example by the Smart Project and by Keen. (It must nevertheless be admitted that the basis for applying significance tests to retrieval results is not well established, and it should also be noted that statistically significant performance differences may be too small to be of much operational interest.) Unfortunately, the tests of 1968-1978 still show many methodological deficiencies. Thus it is to be regretted that in Yates-Mercer's test of relational indexing indexers and searchers were not sufficiently independent; tests like