IRE
Information Retrieval Experiment
Retrieval system tests 1958-1978
chapter
Karen Sparck Jones
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
The decade 1958-1968 223
of a range of subtests, the figures are an illustrative selection; moreover,
when ranges of performance figures are given, these may, for projects with
rather heterogeneous subtests, be only for the strictly comparable alternatives
of a single subtest.
The variations in individual project findings is well illustrated by
Montague's different experiments, where precision ranged from 4-9 per cent
in one case, with relative recall 83-31 per cent, to 4[OCRerr]74 per cent and 93-31
per cent respectively in another. Considering the tests comparable in type,
i.e. in objective and form, the tests conducted by Schuller, Altmann, and
Shaw and Rothman, with Cranfield 1+, can be considered as a group, along
with those of Sinnett, Cohen et al., Montague, and van Oot et al. on links and
roles. The individual projects report differences in precision ranging from
12.5[OCRerr]1 .7 per cent (Schuller), 51.4-96.5 per cent (Cohen et al., for variable
sets), 4[OCRerr]74 per cent (Montague) or 67.3-88.7 per cent (Altmann), and
iluery
relative recall from 31-93 per cent (Montague) or 80.7-100 per cent (Cohen
et al.) Taking precision and (relative) recall together for comparable data
runs in multi-test projects we get such variations as 12.5 per cent precision
and 73.1 per cent recall-41 .7 per cent precision and 77.4 per cent recall
(Schuller), 42 per cent precision and 57 per cent recall-55 and 66 per cent
(Shaw and Rothman), 57.4 per cent precision and 100 per cent recall-94.0
and 80.7 per cent (Cohen et al., with variable request sets), and 70 per cent
precision and 84 per cent recall-90 and 77 per cent (van Oot et al.). The
operational non-comparative studies of Herner et al., Melton, and Lancaster
were broadly in the 5[OCRerr]60 per cent range for both recall and precision. For
recall (sensitivity) alone, CWRU results ranged from 16 to 98 per cent, while
the normalized recall results for Cranfield 2 ranged from 44.6 to 65.8 per cent.
It may be noted that the specificity results for CWRU ranged from 12 to 98
per cent. Over all these tests taken together, results range from a low of 4 per
cent (Montague) to a high of 96.5 per cent (Cohen et al.) in precision, and, as
far as the comparison is proper, from 31 to 100 per cent in recall (Montague
and Cohen et al. respectively).
These variations can or could, as noted, be accounted for partly by
methodological differences and partly, of course, by the real properties of the
languages being investigated, or the values of dependent variables like
indexing exhaustivity; they might also be attributable to environment factors
like collection subject area. These points are more fully discussed later. It
must nevertheless be emphasized that the variations are not wholly
explicable: if they were we would know how to design information retrieval
systems; and the sheer scale of observed performance variation is worth
noticing.
The interpretations of the findings are equally varied, though there is a
natural tendency, for the more limited tests, for their authors to conclude that
whatever was to be demonstrated has been demonstrated. For example,
Shaw and Rothman conclude that roles and links are not needed, while
Schuller, testing novel Uniterms against UDC, finds Uniterms superior,
though he concedes the complementary utility of UDC. However in some
cases the results, like those of Cranfield 1[OCRerr], were contrary to expectation, and
in the more broadly ranging comparative tests, like Cranfield 2 and CWRU,
the results were surprising: in the first that natural language is competitive,
and in the second that the indexing language is not very important.