IRE
Information Retrieval Experiment
Retrieval system tests 1958-1978
chapter
Karen Sparck Jones
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
The decade 1958-1968 225
likely to represent a persistent underlying reality. The methodological
inadequacies of many of the tests, as illustrated by Sinnett, Cohen et al., or
Montague, for example, are nevertheless very conspicuous. The CWRU
project was indeed specifically intended to constitute a study and development
of retrieval system testing methodology. The defects of many of the tests are
incidentally compounded for subsequent criticism by inadequate reporting,
for example about the number of documents searched.
Many experiments suffered from a general lack of control, both with
respect to the values of the variables of interest and those of more obviously
or possibly related variables. Thus if the variable of interest is the indexing
language, when only one is studied it is not obvious how far the resulting
performance should be attributed to the language itself. Conversely, since for
example indexing depth may be a dependent variable, depth should at least
be held constant, and preferably also systematically changed. A minimal test
would therefore compare languages A and B with respect to indexing depths
I and II. The Cranfield 2 project was deliberately intended to improve on
Cranfield 1 in such respects, and, as just noted, the CWRU project was
designed to make such properly controlled comparisons, and indeed included
subsidiary tests to validate the effectiveness of the controls. Other tests,
notable examples being several of those on links and roles, attempted
reasonably careful comparisons. A number of studies, though perhaps not
involving a high degree of control, included failure analysis. This was done
by van Oot, for instance, and on a large scale by Lancaster. Failure analysis
is not part of an experiment proper, but makes a very important contribution
to the broader study of retrieval system behaviour.
Some authors indeed comment, like Schuller, on the problem of testing, or
at least recognize the limitations of their own tests, for example in sample
size. However some particular methodological inadequacies recur in the tests
of the period, along with the specific failings of individual tests. These
defects can be categorized as first those concerned with the propriety of the
way a real system is being modelled, second those concerned with statistical
aspects of the tests, and third those of evaluation.
In the first category the most noticeable deficiency is the wide use of
`bogus' queries, i.e. queries not put to the system in the ordinary way by its
users. In Cranfield 1 and tests influenced by it like that of Herner et al.,
source document questions were used, i.e. questions based on and designed
to retrieve specific documents; and in other tests, like those of Montague and
Cros et al., synthetic, made-up questions, assumed typical of real ones, were
used. The results obtained with real and artificial queries may not differ, but
where this has not been demonstrated, there must be doubts about the
validity of tests with artificial queries. Some tests, like the subset ones with
200 documents in Cranfield 2, or Newell and Goffman's, used specifically
constructed document files, i.e. ones with a high density of related papers.
Others, like van Oot et al.'s, used languages specially constructed for the test
document set.
As far as the statistical aspects of testing are concerned, one of the most
striking features of the tests of the period, taken as a whole, is the small
number of requests used. For example, Sinnett used 22, Shaw and Rothman
9, Cohen et al. 14-33, Montague 29, 33 and 10, Melton 12, Spencer 1
(admittedly not a query in the ordinary sense), Shaw and Rothman 9, and