IRE
Information Retrieval Experiment
Retrieval system tests 1958-1978
chapter
Karen Sparck Jones
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
The decade 1968-1978 237
common in the 1 970s, for unweighted as well as weighted searching. It is
customary in Smart, for example.
Unfortunately, proper comparative experiments in searching present
many problems. As Keen notes in describing EPSILON103, and discusses
more fully in Chapter 8 in this volume, the difficulties of designing and
conducting satisfactory experiments in manual searching are very great; and
Barraclough et al.'s study of users' search behaviour59, for example, was a
very simple observational one. It is also not clear how radically different
logics, like boolean and ranking ones, should be compared: indeed Evans61' 62
asks whether such comparisons are meaningful (see also Chapter 14 in this
volume). It should be noted too, that evaluating recall for very different
strategies using pooled output may introduce dangerous biases.
The searching tests done between 1968 and 1978 were chiefly concerned on
the one hand with search logics, and on the other with searcher behaviour,
in both cases in operational environments. The many experiments on query
modification by relevance feedback, like those carried out by the Smart
Project are, as mentioned earlier, more naturally considered under the
heading of automatic indexing, since the role of the user in detailed decision
making in query modification is very limited, and numerical calculations of
a kind justifying a computer are ordinarily involved in the searching.
At the detailed level of objective, form and result, the tests on manual
searching have rather little in common, not surprisingly considering the
complexity of the topic. Both systematic comparisons and generalizations
about them can therefore only be limited. The tests on search logics included
Evans61' 62 and Miller's68-70 experiments, and also Katzer's cost-oriented
investigation57. Some of Aitchison et al's work49 is also relevant, as is that
of Cleverdon101. Studies of searcher behaviour were carried out during the
period by Barber, Barraclough and Gray104 and Barraclough et al.59, Olive
et al.'s test comparing manual scanning for SDI with automated searching
was a `semi search test50; and service studies like Lancaster, et al.89, Leggate
et al.58, and the UKCIS investigation53' [OCRerr] involved examination of search
specifications and searching.
The focus of Evans' experiment (see Chapter 14 in this volume) was to
compare a range of different term or term group weighting schemes involving
ranked output, and also boolean specifications, of Miller's to compare the use
of weights (in fact relevance weights) with boolean searches, of Katzer's to
compare `grades' of boolean logic, specifically for cost. Cleverdon compared
boolean searching with co-ordination ordering, and Aitchison et al. indirectly
compared boolean and co-ordination[OCRerr]rdered searching, and also different
query formulations (which they called strategies), broad, medium and
narrow. Olive et al. compared automatic scanning with human for current
awareness, Barber et al. users and experts as searchers. Barraclough et al.
simply observed user behaviour in searching online, and the other
investigative projects like the UKCIS one just noted features of searches.
The motivation for the comparative tests in the first group was to evaulate
the less restrictive, especially weighting, schemes, that of the second group
the simpler, less expensive approaches, like automatic searching in Olive et
al.'s study, having the user rather than an expert search in Barber et a!. `s test.
The common assumption was that the simpler approaches were adequate.
In form the tests followed the general pattern, the main feature of interest