IRE Information Retrieval Experiment Retrieval system tests 1958-1978 chapter Karen Sparck Jones Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. The decade 1968-1978 239 devoted to much more thorough performance evaluation than that of the previous decade. A significant feature of the generally statistical approaches adopted has been the idea of relative rather than absolute merit, whether in the characterization of individual documents, of a collection, of requests, or of document-query matches. Manual indexing tends to involve an all or nothing approach to indexing and retrieval. Numerical measures of merit can of course be used with a threshold to select items in indexing or searching, but more power, because more discrimination, is involved in the general idea of weighting; and as indicated earlier, a good deal of the theoretical work in information retrieval in this decade has been concerned with the notion of ranking determined by probability. Evaluation tests on automatic indexing and searching were chiefly devoted to statistical methods, not simply in the absence of non-statistical techniques, but with the support of the theories justifying statistical approaches to indexing and matching. The tests have included ones on individual document index term weighting, though not selection, on vocabulary selection and weighting, on term clustering and document clustering, and on query term selection and weighting. A number of projects have carried out experiments on more than one of these: the Smart Project work in this decade in particular has included tests in all of these areas4' 63-66, 96-98 Sparck Jones has been concerned with vocabulary selection and weighting, term clustering75- 77, 79, 80, 93-95, and query weighting, and van Rijsbergen with term clustering, document clustering, and query weighting81' 105-107 Automatic methods not using relevance information To consider work on automatic methods other than those involving relevance information first. There has been no evaluation testing of methods for the direct selection of terms for documents along the lines of Damerau's earlier investigation, though Evans tested indexing by automatic assignment of manual thesaurus terms86. Simple weighting by within-document term frequencies has been studied by the Smart Project96' [OCRerr] More attention has been devoted to the treatment of the collection vocabulary, as in Salton's use of discrimination functions to select and weight vocabulary terms96 [OCRerr] or the use by Salton96' [OCRerr] and Sparck Jones93' 95 of inverse document frequency weights. A whole range of tests with term clusters, used either to define classes of substitute terms or sets of additional terms was carried out by Vaswani and Cameron78 and by Sparck Jones80' 95, and a more restricted test by Cagan1 08 Smart Project tests on term clustering during the decade have been rather restricted ones with modified manual thesauri and `statistical phrases'65' 97, 98, Document clustering has been studied by the Smart workers63 and by van Rijsbergen105-107 The focus, motivation and assumptions of these tests were very much those of the previous decade. The general aim has been to demonstrate the value of statistical selection, weighting and classification techniques for retrieval, mostly by comparison with their absence, but sometimes, as in some Smart tests, by comparison with manual alternatives. More specific concerns have been to evaluate competing statistical methods for providing a given device, for example approaches to term classification in Vaswani and Cameron's and Sparck Jones' experiments, and to term weighting in many