IRE
Information Retrieval Experiment
The Cranfield tests
chapter
Karen Sparck Jones
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
Cranfield 2 277
detailed figures and graphs presented in Volume 2. However since relying
only on one form of measure might be dangerous, fallout figures were worked
for the main runs; and an alternative representation of recall and precision
using ranking rather than levels, the so-called `document output cutoff'
method, was also supplied for the main run outputs.
The discussion of these extremely difficult issues in the Report is important
both in showin[OCRerr] the attention paid to the question by the project, and in
emphasizing their intractibility for any project.
It is impossible here to do more than refer briefly to the great mass of
individual results presented in Volume 2: in providing this detail for reader
study the Cranfield 2 Report is much superior to that for Cranfield 1. It is
sufficient to note that the main results fall into 9 groups: the first group (4.1)
gives performance for the 221 x 1400 and 42 x 200 collections for several
single term languages, for the authors supporting their view that the smaller
collection could justifiably be used for most of the experimeilts; the second
(4.2) compares all the recall devices for single terms for the 42 questions and
200 documents, showing some loss of performance with the most gross term
reduction; group 4.3 tests concepts and themes, for small query sets but 1400
documents, showing not much difference in performance; group 4.4 examines
exhaustivity levels with single terms, for both large and small collections,
again showing not much variation in performance; group 4.5 studies search
rules for the single term languages, for small query sets but 1400 documents,
suggesting some superiority in the more stringent strategies; the concept
languages compared for the 42 x 200 collection in 4.7 show large performance
variations, and this is also true of the controlled languages compared in 4.8
for this collection; abstracts and titles regarded as indexing languages are
compared in 4.9, again for the 42 x 200 collection, showing abstracts inferior.
Section 4.6 covers a secondary variable comparison on the different relevance
grades, for the single term languages.
To obtain an overview of the co-ordination level results some comparisons
are made of performance at specific co[OCRerr]ordination levels: for example for the
42 x 200 collection at co[OCRerr]ordination level 3 (Figure 6. lOT), the various single
term languages range from recall 66.7 per cent with precision 14.8 per cent
for the simplest language to 82.3 per cent and 7.4 per cent for the most
`condensed' hierarchical one. For co-ordination 2 for the concept languages
(Figure 6.1 2T), deemed comparable with level 3 for the simple terms,
performance ranges from recall 84.8 per cent with precision 6.1 per cent for
the most condensed to recall 14.1 per cent with precision 50.9 per cent for the
given basic indexing language, while for controlled indexing at level 2 (Figure
6. 14T) performance ranges from 68.7 per cent with 12.6 per cent precision
for the basic to 94.4 per cent and 5.1 per cent recall and precision for the most
condensed descriptions. The picture is of low recall and high precision for
concepts, higher recall and lower precision for single terms, and highest
recall and lowest precision for controlled. Comparing the graphs for the most
basic members of the three classes shows single terms and controlled very
similar, with concepts with very much lower recall (Figure 6.1 P); however
when the best members of each class are taken performance is very similar,
with single terms probably superior to controlled and definitely superior to
concepts (Figure 6.2P).
The main aim of the alternative document output cutoff representation