IRE
Information Retrieval Experiment
Retrieval system tests 1958-1978
chapter
Karen Sparck Jones
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
Conclusion on l958[OCRerr]l968 229
the recall-precision line. These conclusions were endorsed by the multi-
collection tests done in the later part of the decade by the Smart Project.
Methodologically the Cranfield 2 project showed how informative testing
required a more systematic breakdown of a retrieval system into its various
factors than was common earlier. Gross comparisons between distinct
languages were replaced by a much more detailed study of recall and
precision devices generating families of languages. The CWRU project was
similarly directed toward a much more careful treatment of system factors in
a range of comparative experiments than was generally adopted. However it
is interesting to note that the attempt to ensure control in CWRU led to new
difficulties, in this case to a very artificial and perhaps perverted treatment of
queries, i.e. maintaining constant queries for different languages tended to
suppress distinctive features of the languages themselves. The same trend is
well shown in the Smart Project work which by 1968 was well into a very
large range of detailed studies. In this case the emphasis on automatic systems
provided not only new opportunities for system design, for example in
permitting ranked output, but also ones for system testing in the comparative
ease with which grinding tests over ranges of slightly different variable
values could be conducted, and in the application of complex measurement
and statistical evaluation techniques. However the sheer proliferation of
explicit parameter settings served to bring out not only the increasing
numbers of runs needed for proper comparative experiments, but the
difficulties of ensuring a meaningful experimental design.
The Cranfield 2 and CWRU projects in many ways looked backward,
seeking to improve on the initial index language tests of the decade. But they
also, in the challenge implied by the comparative flatness of their findings,
and in their methodological quality, presented a reference point for the work
of the next decade. The Smart Project, while sharing this character to some
extent, is a more genuine pointer to future work in more throroughly
embracing the possibilities offered by the computer, particularly for
sophisticated search strategies and non-conventional methods of massaging
term descriptions, for example by numerical weighting and feedback
techniques.
But even these projects suffered, as already indicated, from many
limitations; and the general character of the testing done in the field during
the decade is well described by Saracevic:
`At present real and productive testing of total retrieval systems, taking
into account and controlling all inside and environmental factors, is not
feasible and not possible. At present, it seems that generalisable, formal,
quantified results of high validity and reliability on all or even on the
majority of factors affecting the performance of retrieval systems cannot
be attained.
The reasons are fairly evident. There is an absence of a well-formulated
theory taking into account all or a majority of the factors operating on
retrieval systems. There is only an intuitive understanding of objectives of
retrieval systems-thus, the measures indicative of the achievement of
objectives are not totally reflective of the real objectives and not
comprehensive. There is an inadequate knowledge of processes involved
within or outside the IR Systems and, without a thorough understanding of
processes, comprehensive testing is unattainable. There is a lack of
standardised methodologies for experimentation, which precludes testing.