IRE Information Retrieval Experiment Retrieval system tests 1958-1978 chapter Karen Sparck Jones Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. 236 Retrieval system tests 1958-1978 conduct for systematic comparisons. However taken together the results show that large differences of exhaustivity do affect performance, typically trading recall for precision. Schumacher et al's findings (assuming constant indexing quality) show this very clearly: with increasing exhaustivity he obtained a substantial gain in recall, with a gradual, though not enormous, decline in precision. Thus recall relative to the full text relevant retrieved progressed from 25 per cent for titles to 72 per cent for titles plus abstracts, contents lists and author keys, while precision dropped from 65 to 56 per cent. Keen found recall rose from 74.7 to 85.8 per cent, but for an increase in median non-relevant retrieved from 18.9 to 24.4, for controlled language document indexing on two levels of exhaustivity. Cleverdon found that for varying natural language exhaustivity for both requests and documents, performance ranged from 70.5 per cent recall (relative to an independent sample) and 32.2 per cent precision to 80.6 per cent recall and 18.1 per cent precision. However, as Sparck Jones suggests, small differences are not important and exhaustivity in document indexing can be consciously counterbalanced by the treatment of requests. This is indeed implicit in the use of extended profiles for title searching in operational services. Cleverdon's results also suggest the possibility of trade-offs, as do Aitchison et al.'s tests of different query formulations, broad, medium or narrow. Searching tests The evaluation tests on searching include some of the most interesting of the decade. It is, however, difficult to give a coherent account of them, since the whole searching subcomponent of a retrieval system is an extremely complicated one, and one which is not well understood, and the different tests done have been scattered over the large area of searching as a whole. Searching refers both to the entire interaction between a user seeking documents relevant to a need from a document file, and to any particular expression of this need used to scan some or all of the file. The latter includes the treatment of individual terms and that of the logical structure of the query, and the complex relationship between the two. This is not the place for a detailed discussion of searching, and in the summary account which follows its different aspects will be referred to very crudely. For this purpose we will therefore simply use the term `strategy' for the searching process for a query as a whole, `specification' for any individual matching prescription, `logic' for the formal structure of such a prescription, and `formulation' for the broad or narrow scope of a specification. With respect to logic, the great majority of experiments and investigations have, following operational practice, been concerned with boolean queries, and hence with the measurement of performance for simple sets of retrieved documents. However the idea of subsearches (especially broadening a search) naturally allows for an ordering of output, and some approaches to indexing, notably those involving weighting, can only be properly, or at any rate sensibly, interpreted as generating a ranked, i.e. ordered output. (It should be emphasized that this has nothing to do with the representation of Boolean structure by weights, which is merely a matter of notation.) The Cranfield 2 experiments provided an ordered output, and as noted earlier, it became