IRE Information Retrieval Experiment Retrieval system tests 1958-1978 chapter Karen Sparck Jones Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. The decade 1958-1968 225 likely to represent a persistent underlying reality. The methodological inadequacies of many of the tests, as illustrated by Sinnett, Cohen et al., or Montague, for example, are nevertheless very conspicuous. The CWRU project was indeed specifically intended to constitute a study and development of retrieval system testing methodology. The defects of many of the tests are incidentally compounded for subsequent criticism by inadequate reporting, for example about the number of documents searched. Many experiments suffered from a general lack of control, both with respect to the values of the variables of interest and those of more obviously or possibly related variables. Thus if the variable of interest is the indexing language, when only one is studied it is not obvious how far the resulting performance should be attributed to the language itself. Conversely, since for example indexing depth may be a dependent variable, depth should at least be held constant, and preferably also systematically changed. A minimal test would therefore compare languages A and B with respect to indexing depths I and II. The Cranfield 2 project was deliberately intended to improve on Cranfield 1 in such respects, and, as just noted, the CWRU project was designed to make such properly controlled comparisons, and indeed included subsidiary tests to validate the effectiveness of the controls. Other tests, notable examples being several of those on links and roles, attempted reasonably careful comparisons. A number of studies, though perhaps not involving a high degree of control, included failure analysis. This was done by van Oot, for instance, and on a large scale by Lancaster. Failure analysis is not part of an experiment proper, but makes a very important contribution to the broader study of retrieval system behaviour. Some authors indeed comment, like Schuller, on the problem of testing, or at least recognize the limitations of their own tests, for example in sample size. However some particular methodological inadequacies recur in the tests of the period, along with the specific failings of individual tests. These defects can be categorized as first those concerned with the propriety of the way a real system is being modelled, second those concerned with statistical aspects of the tests, and third those of evaluation. In the first category the most noticeable deficiency is the wide use of `bogus' queries, i.e. queries not put to the system in the ordinary way by its users. In Cranfield 1 and tests influenced by it like that of Herner et al., source document questions were used, i.e. questions based on and designed to retrieve specific documents; and in other tests, like those of Montague and Cros et al., synthetic, made-up questions, assumed typical of real ones, were used. The results obtained with real and artificial queries may not differ, but where this has not been demonstrated, there must be doubts about the validity of tests with artificial queries. Some tests, like the subset ones with 200 documents in Cranfield 2, or Newell and Goffman's, used specifically constructed document files, i.e. ones with a high density of related papers. Others, like van Oot et al.'s, used languages specially constructed for the test document set. As far as the statistical aspects of testing are concerned, one of the most striking features of the tests of the period, taken as a whole, is the small number of requests used. For example, Sinnett used 22, Shaw and Rothman 9, Cohen et al. 14-33, Montague 29, 33 and 10, Melton 12, Spencer 1 (admittedly not a query in the ordinary sense), Shaw and Rothman 9, and