IRE Information Retrieval Experiment Retrieval system tests 1958-1978 chapter Karen Sparck Jones Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. V Methodological and substantive achievements 247 document indexing options, Evans' investigated a fairly heterogeneous selection of search strategy options. More generally, while it might be claimed that multi-collection experiments comparing search procedures like those of Sparck Jones95 represent some methodological and substantive advance compared with Dale and Dale's test of some ten years earlier21, Atherton's BOOKS test84 does not represent much advance methodologically over Cranfield 1 (Ref. 1), and is indeed substantively focused on the same system component, document indexing. Js it nevertheless possible to point to any general methodological and substantive advances? To take methodology first. It is possible to point to a general advance in the quality of retrieval experiments. Specifically, we can say that experiments now are more likely than they were twenty years ago: (1) to use real data, for example not requests based on source documents; (2) to have enough data, for example fiffy requests rather than ten, and five thousand documents rather than five hundred; (3) to introduce more control in relation to data variables, for example by using more than one collection; (4) to discriminate better among mechanism variables, for example by distinguishing language specificity from indexing specificity; (5) to utilize more appropriate performance measures, for example by interpreting recall in relation to sets of documents rather than single sought documents; (6) to conduct tests more carefully, for example by utilizing Latin square designs for assigning tasks to people; (7) to evaluate findings properly, for example by applying significance tests. But, as the survey has shown, not all experiments meet these conditions. The effort of conducting proper experiments, which proposals like those for the `ideal test collection' were intended to reduce, remains very great, so many tests are limited in scope. Again, though tests appear to reflect a growing consensus, for example in the use of recall and precision, many test reports suggest that little attempt has been made to learn from the experience of previous workers. Where such complicated matters as techniques for deriving recall/precision graphs are concerned the lack of rigour is not surprising, but it is still unfortunate. These defects of current test methodology are nevertheless really only the manifestation of deeper problems about the substantive aspects or retrieval systems. Differences of test design in part reflect genuine differences of test purpose and emphasis. Thus there is no very good reason why an efficiency oriented test to determine indexing speed under different working conditions should have much in common with an effectiveness oriented test to evaluate the utility of not-logic in boolean searching. However there are many features of information retrieval systems which are not sufficiently understood, even at the level of reliable description, let alone analytical modelling: appropriate choices of performance measure are an example. Looking at the substantive side of retrieval systems, what contributions have twenty years of testing made to system understanding and hence system design? As pointed out earlier, statements here can only be very general ones. Thus it appears that the tests which have been carried out show: