IRE Information Retrieval Experiment The Cranfield tests chapter Karen Sparck Jones Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. Cranfield 1 261 \¾I scheme, neatly described as based on `literary warrant', the main problem was in splitting terms into elementary units. Taken together, these details illustrate both how familiar problems of indexing recur in any environment, and how they can perhaps be handled for medium-sized systems. The Report also gives statistical details about the indexing which are both of interest in themselves and relevant to the indexing performance observed. For example for the indexing subprogramme for the last 6000 documents there were 2350 different notational elements for UDC, 2684 main alphabetical headings, 1686 facet notational elements, and 3174 Uniterm terms. For individual documents, variations in the numbers of terms for the different indexers were noted, and the average numbers declined with indexing time. For example, average postings forjournal articles were 7.6 for UDC for 16 mm as opposed to 2.3 for 2 mm, 4.7 and 2.2 respectively for alphabetical subject headings, 1.7 and 1.2 for facetted, and 11.0 and 6.0 for Uniterms (Table 6). To check the main indexing, alternative, independent (`supplementary') index descriptions were supplied by people outside the project. These suggested that the project indexing was well enough organized, but also tended to show a lack of agreement in the indexing done by different people for the same document. In the course of the indexing a good many low-level administrative decisions had to be taken, for example about index formats; and while the project was, as is noted in the Report, primarily concerned with intellectual questions, the account of the various low-level devices and procedures is very useful in illustrating the large amount of nitty gritty involved in any substantial test, and the need for care in these aspects of testing. These processes incidentally provided useful information about the time taken in clerical operations, and also emphasize the scale of this early project: 100 000 cards were punched for the Uniterm indexing, for example. In introducing the account of the searching representing the test proper of the four indexing systems, in Volume 2 of the Report3, Cleverdon comments on the need for `a method which would enable an assessment to be made of the effects of the variables which had been built into the indexing.' [OCRerr]. 7) It was decided that though each group of 100 documents had distinctive characteristics, and so should be studied individually, testing over a 6000 document subprogramme set, and more specifically that of the final subprogramme reflecting the indexer's established indexing experience, would be sufficient. As Cleverdon notes, there was very little guidance available on the conduct of tests, the ASTIA Uniterm test of 1953 being both inadequately reported and evidently unsound methodologically in lacking controls, especially with respect to relevance decisions. The organization of the search testing was thus strongly motivated by a desire to avoid getting `bogged down in the quagmire of arguments concerning relevancy' (p.8), and also by the need for statistically valid results. The first requirement was met by using questions based on source documents and searching for these, the latter by complicated (and, it must be said, very opaquely described) sampling procedures. A set of 1200 questions was therefore obtained by