IRE Information Retrieval Experiment Retrieval system tests 1958-1978 chapter Karen Sparck Jones Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. 226 Retrieval system tests 1958-1978 Herner et al., for calculating recall, 10. Document samples were better, but were sometimes small: for example Cranfield 2 used 200 documents in many experiments, Newell and Goffman 210. In the measurement of performance, the most pervasive methodological inadequacy is the arbitrary treatment of recall, or at any rate the rather particular interpretation given to it, without any awareness of possible bias, Newell and Goffman, and Melton, for instance, define decall as the retrieval of cited documents, Cranfield [OCRerr] (Warburton and Cleverdon), 1, and 1+ recall relative to a source document, the Syntol group recall for automatic abstracts relative to manual, and Lancaster recall relative to an independently obtained set of relevant documents. In other cases recall is measured relative to the pooled output of alternative searches, but then performance for an individual language being tested depends on the character of the different pool contributions, which may not be strictly comparable. Individual tests moreover reveal a variety of other dubious procedures, for instance Cohen et al. compare various link/role combinations using different numbers of queries, and even in the in many ways model CWRU tests9 results are lumped together oddly, for example those for different indexing languages are combined to provide performance figures for different indexing sources. Overall, the tests taken together can only support the broadest and most tentative conclusions: the variation in data was vast, and the performance measures used were not only directly incomparable, for instance where one project uses precision another opts for specificity, but incomparable in more subtle ways, for example in averaging technique. Moreover, as relative recall depends on an `arbitrary' base, it can give very different results according to base: specifically, values will be absolutely higher for languages with a similar performance than for those with a different, but complementary1 performance. Automatic indexing tests As noted earlier the character of work in automatic indexing was rather different from that done on manual indexing. The many theoretical and computational problems involved meant that more work had to be put into simply establishing the feasibility of procedures and prima facie plausibility of results. There were therefore more studies of a non-evaluative kind, and fewer evaluative ones. This is not the place to review work on automatic indexing in detail. Briefly, it was primarily concerned, on the one hand, with statistical methods of identifying, by extraction from text, words representing individual documents or sets of documents, and on the other with statistical methods of recognizing relations between words supplying substitute or additional search keys. The work was very much within the framework of post-coordinate indexing, and was chiefly devoted to the examination and programming of statistical methods. Related research was concentrated on simpler methods of keyword selection, for example by text location, as in O'Connor's work221 or on the application of statistical techniques in assigning items from a manual indexing vocabulary, as in Gotlieb and Kumar's test39. Closely related ideas studied were those of term weighting and output ranking.