IRE
Information Retrieval Experiment
Retrieval system tests 1958-1978
chapter
Karen Sparck Jones
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
232 Retrieval system tests 1958-1978
based, and so are connected with other statistical approaches to retrieval.
The statistical work of 1968-1978 is in turn linked with that of the earlier
decade. As noted, much of the automatic indexing research done between
1958 and 1968 did not progress as far as evaluative performance testing. The
early 1970s saw reports on statistical term cluster evaluation by Vaswani and
Cameron78 and Sparck Jones79' 80, and recent experiments by Harper and
van Rijsbergen81 have specifically combined the use of term associations
with that of relevance weights. Other rather crude methods of non-statistical
automatic (or in principle automatic) indexing are represented by O'Connor's
work on passage retrieval82' 83 and by Atherton's BOOKS project84.
Klingbiel and Rinker85 and Evans86 report tests of semi-automatic indexing
involving some reference to a dictionary or thesaurus. In general statistical
clustering has proved very disappointing, and the main thrust of statistical
work has been on the more promising weighting. For the purposes of
discussion we can therefore consider two groups of tests: those on automatic
and especially statistical indexing not involving relevance information, and
those on relevance feedback and weighting.
Some of the language, indexing, and searching tests were carried out in the
context of operation services, for example by Aitchison et al., Barker et al.1
Olive et al. and, recently, Cleverdon87. There have also been more restricted
investigations, rather than experiments proper, relating to services, such as
those carried out by Rowlands88, Lancaster, Rapport and Kiffin Penry899
Leggate et al., Hansen90, Simkins91, and Pollitt92. Such operational tests
were often concerned, and perhaps more than those of the previous decade,
with cost efficiency as well as performance effectiveness, and some studies1
like that of Katzer57, have been wholly devoted to costs. The increasing
volume of information and development of information services have also
been matched by a corresponding growth of user studies, data base coverage
investigations, and so on. There have also been many bibliometric studies,
some of a very academic character.
Overall during this period we can detect two major strands in testing,
reflecting an increasing divergence between the concerns of operational
system managers and those of research workers. Projects under the first head
concentrated initially on indexing languages and then on the related topics
of indexing and searching. Research workers have also concentrated
increasingly on searching, as in the relevance weighting experiments, but
within the framework of theoretical approaches implying sophisticated
procedures like output ratiking. It is therefore paradoxical that some research
findings which appear particularly suited to modern computer systems should
have made no impact on the operational scene.
From a more intellectual point of view it will be evident that we can
combine the five topic groups listed to form two broader groups of test which
in fact continue the previous decade's interests in manual and automatic
systems respectively. Thus the work on index languages, on indexing, and on
user searching strategies is all oriented towards manual systems or the human
elements of automatic systems. The work on statistical or other `mechanical'
forms of indexing, and on statistical and `mechanical' techniques for query
modification, on the other hand, is a continuation of the automatic indexing
research of the 1 960s. In what follows, these two broad groupings should be
borne in mind, though the detailed discussion is more conveniently and,