IRE
Information Retrieval Experiment
Retrieval system tests 1958-1978
chapter
Karen Sparck Jones
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
I
Conclusion on 1968-1978 243
document retrieval research with much earlier work and with current
research on non-bibliographic databases.
12.6 Conclusion on 196[OCRerr]1978
Overall, when we look at the evaluation tests of the decade from a substantive
point of view, we can see on the one hand a rounding out of the work done in
the previous decade, and on the other one possible line of non-conventional
performance improvement. The experiments as a whole show simple
indexing as good as sophisticated as far as the language is concerned, with
the actual treatment of the query more important as a determiner of
performance than anything else, and performance in any case very difficult
to raise above a broad 50 per cent precision-50 per cent recall level. It
appears, more specifically, that the requirement to be met to reach this level,
is that of adequate exhaustivity of indexing, primarily of the query (unless
the user's only concern is with precision). The relevance weighting techniques
studied during the decade may not raise performance above the middling
level apparently at best attainable in practice, but they may provide a very
helpful and cheap way of raising performance from the lower levels likely to
be actually attained in practice. It is however the case that the substantive
remarks that can be made about the results obtained between 1968 and 1978
are, like those of the previous decade, very general, and the more novel ones
are rather tentative.
Methodologically, the tests taken together show some improvement over
those of the previous decade's, but regrettably not enough. A particular
contribution has been made by the use of individual test collections by more
than one project, and of several collections by individual projects. In the first
case the Cranfield 2 data especially have been widely used, but other test
collections like Keen's ISILT one and a UKCIS one have also been utilized
by more than one project. This does not of course mean that any defects of
the data as created are removed, but at least the results of one project can be
related to others, and equally, particular results supported by related tests,
including those involving different methods of performance representation.
The Smart Project was a pioneer of multi-collection tests, and the importance
alike of those tests showing different results for different collections and those
showing the same result for different collections, cannot be overestimated.
Sparck Jones has also used a range of increasingly large test collections in
laboratory experiments.
As noted earlier, a good many of the tests of the decade exhibited more
careful control of both primary and secondary variables, and a concern with
the validity of findings which has led to a wider application of statistical
significance tests, as for example by the Smart Project and by Keen. (It must
nevertheless be admitted that the basis for applying significance tests to
retrieval results is not well established, and it should also be noted that
statistically significant performance differences may be too small to be of
much operational interest.)
Unfortunately, the tests of 1968-1978 still show many methodological
deficiencies. Thus it is to be regretted that in Yates-Mercer's test of relational
indexing indexers and searchers were not sufficiently independent; tests like