IRE Information Retrieval Experiment The Cranfield tests chapter Karen Sparck Jones Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. Criticisms of Cranfield 1 269 themselves than for its implications for evaluation due to the close relationship between the question searched and the document sought: `in a real situation, a source document generally does not exist. In information-retrieval experiments, if specific documents are used as sources of questions, any meaningful tests of retrieval-system effectiveness must be made on new documents which can be presumed free of any unusual or direct influence on the wording or nature of the question.' (p.6) Swanson maintains that the source documents should have been excluded since there might be a difference in retrievability between source and non- source documents. He argues that the `bias' of the test is likely to have been exacerbated by the very close verbal relation between source document titles and questions, and argues from a sample he investigated that simply using a machine-based match between questions and titles would have given recall of 85 per cent. Thus if the different index language descriptions covered the titles well, a high and similar level of recall would be reached whatever the properties of the languages. Swanson moreover argues that a similar bias existed in the Cranfield 1+ test, which also used source documents. Thus the conditions of the test in both Cranfield 1 and Cranfield 1+ could not be expected to distinguish the languages tested adequately, even if they are genuinely distinguishable in performance. As a corollary, Swanson argues, the results for indexing times are hardly surprising, and also the fact that maximum recall was approximated is not surprising either, i.e. if the real match is in fact focused on the document title, more sophisticated or extensive indexing is unlikely to be useful. With respect to indexer/searcher memory, Swanson contends that while indexing memory did not influence searching, the fact that the same people were involved means that possible memory effects cannot be excluded as influencing the results. The lack of technical knowledge of the indexers might well not have been important given the question/document link. Finally, Swanson's view is that the heavily stressed result that the four languages performed the same is of little value when recall is considered in relation to precision; and he comments on the fact that included in the test data were figures about the average number of documents retrieved which show that, assuming only one of the retrieved is relevant, as is required, rformance is very different for the four languages, Uniterms being much Psueperior to UDC, with alphabetical and facet in between. Swanson maintains that the figures for recall cannot be taken at their face value. Swanson further points out that the supplementary Cranfield experiment designed to test recall of non-source documents was defective in using relevance assessments made by referring other documents to the source document, and he notes that, given the different search procedure also used, in the absence of precision figures, comparison for recall with the main test results is very dubious. Swanson also argues that the Cranfield 14 results do not support the Cranfield 1 findings in any clear way, and in particular show different performance for source and non-source documents. He nevertheless agrees that `so far as average behaviour of information-retrieval systems is concerned