IRE
Information Retrieval Experiment
The Cranfield tests
chapter
Karen Sparck Jones
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
Criticisms of Cranfield 1 269
themselves than for its implications for evaluation due to the close
relationship between the question searched and the document sought:
`in a real situation, a source document generally does not exist. In
information-retrieval experiments, if specific documents are used as sources
of questions, any meaningful tests of retrieval-system effectiveness must be
made on new documents which can be presumed free of any unusual or direct
influence on the wording or nature of the question.' (p.6)
Swanson maintains that the source documents should have been excluded
since there might be a difference in retrievability between source and non-
source documents. He argues that the `bias' of the test is likely to have been
exacerbated by the very close verbal relation between source document titles
and questions, and argues from a sample he investigated that simply using a
machine-based match between questions and titles would have given recall
of 85 per cent. Thus if the different index language descriptions covered the
titles well, a high and similar level of recall would be reached whatever the
properties of the languages. Swanson moreover argues that a similar bias
existed in the Cranfield 1+ test, which also used source documents. Thus the
conditions of the test in both Cranfield 1 and Cranfield 1+ could not be
expected to distinguish the languages tested adequately, even if they are
genuinely distinguishable in performance.
As a corollary, Swanson argues, the results for indexing times are hardly
surprising, and also the fact that maximum recall was approximated is not
surprising either, i.e. if the real match is in fact focused on the document title,
more sophisticated or extensive indexing is unlikely to be useful.
With respect to indexer/searcher memory, Swanson contends that while
indexing memory did not influence searching, the fact that the same people
were involved means that possible memory effects cannot be excluded as
influencing the results. The lack of technical knowledge of the indexers
might well not have been important given the question/document link.
Finally, Swanson's view is that the heavily stressed result that the four
languages performed the same is of little value when recall is considered in
relation to precision; and he comments on the fact that included in the test
data were figures about the average number of documents retrieved which
show that, assuming only one of the retrieved is relevant, as is required,
rformance is very different for the four languages, Uniterms being much
Psueperior
to UDC, with alphabetical and facet in between. Swanson maintains
that the figures for recall cannot be taken at their face value.
Swanson further points out that the supplementary Cranfield experiment
designed to test recall of non-source documents was defective in using
relevance assessments made by referring other documents to the source
document, and he notes that, given the different search procedure also used,
in the absence of precision figures, comparison for recall with the main test
results is very dubious.
Swanson also argues that the Cranfield 14 results do not support the
Cranfield 1 findings in any clear way, and in particular show different
performance for source and non-source documents. He nevertheless agrees
that
`so far as average behaviour of information-retrieval systems is concerned