IRE
Information Retrieval Experiment
The Cranfield tests
chapter
Karen Sparck Jones
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
280 The Cranfield tests
notes that normalized recall, used for the overall language ranking, is not
very realistic as it does not depend on the cutoff point. At the same time he
remarks with respect to the overall observations based on the ranking,
namely that normalized recall is low for concepts as opposed to single terms,
that it is low for generic single terms, and that it rises for generic concepts,
that the third observation is widely accepted, that the second is not
unexpected, but that the first is unexpected and so requires specific refutation
by the advocates of controlled languages. In Vickery's view,
`the volumes of this report are an impressive account of a complex piece of
research, undertaken with care and diligence. They give no final answers,
and their conclusions must be treated with caution, but they are a valuable
exploration of the retrieval process.' (p.340)
Following the line of his attack on Cranfield 1, Swanson21 suggests that the
relation between questions and relevant documents is much too close, due to
some features of the assessment procedure, especially the second step: thus
the students screening for additional documents relevant to a question, used
documents already known to be relevant to the question, while the user was
allowed to modify his initial query after assessing the extra documents.
Swanson indeed maintains that in the initial question provision the
documents cited by the source paper, having been read, could have influenced
the verbalization of the question. However a more serious criticism, in
Swanson's view, is implied by some facts about the additional relevant
documents: namely that bibliographic coupling gave a good many relevant
documents not identified by the student screeners. This suggests that a large
number of relevant documents were in fact missed altogether, the explanation
being that the students were poor screeners as they only used titles, while the
bibliographic coupling was only done at a high level. The implication is that
the results for the whole set of experiments may be unreliable.
This point is considered in detail by Harter22, who seeks to show, by
formal arguments applied to real data, first that a good many relevant
documents were missed; second, that changes in the relative proportions of
missed to non-missed can affect recall/precision point values (as well as
values over a cutoff range), with the important consequence that relative
performance ratings can change; and third, using additional relevance data
for a sample of languages, that the picture of relative language merit given by
Cranfield 2 for these languages is changed when the additional relevance
information is utilized. Unfortunately, while Harter's general point is sound,
he indulges in some wild statistical extrapolation and very speculative global
statements about the Cranfield results. He makes a good case for the principle
that there may well be missed relevant documents unless evaluation is truly
exhaustive and that omissions can affect performance, but his actual
investigation of the data suggests that it is mostly the less important result for
the title language, for which test biasses are evident, that was really affected
in practice.
Other more general comments were made by, for example, Sharp and
Rees. Sharp23 notes that
`Cleverdon et al. . . . have qualified the basic recall/relevance thesis so that
its application now seems so limited as to be confined to those conditions
where. . . its truth is obvious.' (p.92)