IRE
Information Retrieval Experiment
Retrieval system tests 1958-1978
chapter
Karen Sparck Jones
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
The decade 1968-1978 235
their set of experiments, in one specific comparison on co-ordination
matching (Ref. 49, Figure 1.39), one performance graph ranges from recall 27
and precision 6 to recall 1 and precision 33, while another, far away, ranges
from recall 81 and precision 4 to recall 2 and precision 100, both having the
characteristic sagging shape. This is an enclosed area difference of several
3 hundred per cent. There are also considerable differences in the relative
locations of graphs for different projects. Thus by comparison with the
Aitchison et al. graphs just mentioned, Cleverdon101 gives co-ordination
matching figures generating graphs for recall 99 with precision 34 to recall 10
and precision 97, without sag.
In interpreting their findings the authors of the various projects tended to
conclude that, other things being equal, different languages perform much
the same. Controlled languages are perhaps slightly superior, but natural
language is very competitive; Klingbiel and Rinker specifically found that
machine-aided indexing could be very successful. The more specific
conclusion drawn was that, especially where costs are concerned, natural
language, and particularly automatically scanned text, is a good bargain,
though absolute performance is not striking. However the results obtained by
Yates-Mercer for a non-trivial document set, namely recall of 76 per cent
with precision of 77 per cent, apparently show that a much higher level of
performance is attainable than that generally achieved in the comparative
tests, or in service investigations like Lancaster et al.'s (relative recall 48.0
per cent, precision 59.3 per cent)89. Collectively, the implications of these
tests are those of the comparable tests of 1958-1968, namely that, where
dependent variables like exhaustivity are controlled, languages behave
similarly, and it is the other factors like exhaustivity and searching which are
much more important; languages matter only in relation to the system's
specific recall or precision performance objectives. The inverse relation
between recall and precision is again quite clear.
Indexing tests
The studies of indexing were, as noted, especially concerned with exhaustiv-
ity: see, for example, Cleverdon100, Keen51' 52, and Schumacheretal.55, and
also Sparck Jones94. The general style of these tests was very like that of the
language studies just described, and indeed the two were often closely
connected as, for example, in Cleverdon and Keen. Schumacher et al's
experiment tested description exhaustivity over an exceptionally wide range,
his specific aim being to investigate the use of progressively longer sources
for controlled indexing, from titles, through abstracts, to the body of the text,
the sources being associated with increasing exhaustivity of index description.
The topic well illustrates the difficulties of testing since the use of different
sources to provide descriptions of differing exhaustivity may also introduce
quality variations. Schumacher et al.'s test is open to this criticism, as are
Aitchison et al. `549 and Barker et. al.'s53' 54 studies of the use of different text
descriptions in machine searching, which may be viewed as exhaustivity
tests. Cleverdon, Keen, and Sparck Jones, for example, were more careful to
test for indexing from the same source. In form the tests were like those of the
main group, most using rather small data: Schumacher used 99 requests and
984 documents, for instance; but they differ too much in their detailed