IRE Information Retrieval Experiment Retrieval system tests 1958-1978 chapter Karen Sparck Jones Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. I The decade 1968-1978 241 (i.e. collection frequency) weighting of some utility. However neither Yaswani and Cameron nor Sparck Jones, in substantial series of experiments, could obtain real performance improvements with clustered rather than unclustered terms. Cagan found clustering useful, but in a highly eccentric test. In document clustering, precision can be maintained, but recall suffers. Taken together, the tests would imply that simple statistical techniques are as good as more elaborate ones, but even then yield only modest performance improvements. The main inference to be drawn from the tests was that vocabulary distribution properties may be important for retrieval: for example Sparck Jones found clustering rare terms far more useful than clustering common ones. This observation contributed to the work on weighting. Automafic methods using relevance informatlon The final group of tests to be considered are those concerned with relevance feedback and weighting. The automatic indexing methods discussed so far are based on information about the occurrences and co-occurrences of terms in any documents. The use of more specific information about term occurrences and co-occurrences in relevant documents leads to the relevance feedback and relevance weighting schemes which have been especially important in the research work of the decade. The Smart Project's early tests4' 63, 64, especially those of 1de109, concentrated on feedback methods for adding terms to, or removing them from, queries; later experiments65' 66 were concerned with relevance or `precision' weighting. Sparck Jones has carried out a series of experiments with relevance weights75-77' 95, as have Harper and van Rijsbergen81. In an operational context, Miller68[OCRerr]70, the UKCIS staff(Barkeretal.54' 71 and subsequently Robson and Longman72' 73) have studied relevance-controlled query expansion or weighting schemes. Cameron's approach was rather different, clustering documents using relevance information74 The character of these experiments has been very like that of other automatic indexing tests. Thus the focus of the tests has been a comparative evaluation of searching with and without relevance information. The context has nearly always been that of natural language indexing, though Miller applied relevance weights to MeSH terms. The motivation ha8 been to demonstrate the value of statistical methods of indexing utilizing relevance information. An additional motivation in laboratory tests like those of Robertson and Sparck Jones75 or Harper and van Rusbergen has been to validate a formal theory; while in the service studies like the UKCIS ones, the statistical feedback techniques have been seen as devices assisting the user in reducing the effort of profile preparation. In form the experiments follow the mainstream pattern with recall and precision evaluation, but with output ordering in the laboratory tests where the operational tests have concentrated on comparisons with boolean performance. It is of interest that in this group of experiments quite large test collections were used, not only in operational tests like Miller's, but in some laboratory tests: thus Cameron used some 12000 documents (though few requests), and Sparck Jones over 27 000 documents. 1£