IRE Information Retrieval Experiment Retrieval system tests 1958-1978 chapter Karen Sparck Jones Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. 224 Retrieval system tests l958[OCRerr]l978 Index language test overview The most fruitful way of looking at the results obtained in the tests of the period is to see how the specific findings and the interpretations given them by those concerned can be seen, when the tests are taken together, to show common trends with implications for our understanding of information retrieval system behaviour in general. What emerges from the tests of 1958- 1968 is unlikely to be novel to those familiar with the work, but it is worth emphasizing the fact that these very broad conclusions are supported by the results obtained over a range of projects, and are not simply based on single tests. Thus the tests comparing different languages, like Spencer's, Shaw and Rothman's, the various Cranfield tests, and the CWRU project, show that, other things being equal, different languages achieve comparable levels of performance though they may retrieve different sets of documents and especially relevant documents. A natural corollary is that fancy indexing has no especial merit, as Cranfield [OCRerr] Montague, and Melton show, or, to put it the other way round, that simple indexing has merit, as Blagden, Cranfield 2 and Shaw and Rothman indicate. A related conclusion is that supported by Cranfield 1 and CWRU, that the indexing subsystem is not the overwhelm- ingly important factor in determining system performance. In Lancaster's and Saracevic's view, based on the Medlars and CWRU studies respectively, the treatment of the question, and specifically its proper development, emerges as much more important. Saracevic's general conclusion is that human factors are the most important ones. The related system factors most affecting language performance seem to be the exhaustivity or depth of indexing, noted for Cranfield 1 and 2 and CWRU, and also, according to Cranfield 2, the specificity of the indexing language. The tests taken together indeed support the statement that there is an inverse relationship between recall and precision, which was explicitly studied in the Cranfield 2 experiments, which is influenced both by indexing policies for documents or requests, determining exhaustivity, and by indexing resources in languages, determining specificity. The more detailed studies of the link/role test subgroup provide particular evidence here, showing that both links and roles are precision devices, with roles especially restrictive: and other studies of indexing with relations, like the Syntol test and Cranfield 2, show a similar restrictiveness. Thus the statement that, other things being equal, languages perform the same has to be read as meaning that languages perform the same if document dependent factors are held constant and the languages are not explicitly oriented in opposite directions with respect to recall and precision: if good levels of both recall and precision are required, then when document variables are held constant, languages representing, for example, rather different classificatory philosophies do not differ materially in behaviour. Of course, these observations can only be taken as very broad generaliza- tions, given the great variations in the details of the tests of the period, and also their many methodological deficiencies. The latter might indeed be regarded as sufficiently gross in many cases to undermine any conclusions to be drawn from the tests, but an alternative view is that the tests, however defective, were sufficiently varied that any common result can be regarded as