IRE
Information Retrieval Experiment
Evaluation within the enviornment of an operating information service
chapter
F. Wilfrid Lancaster
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
Some problems of evaluation applied to operating Systems 119
performance of various processing options in SMART. Obtrusiveness may,
however, have some effect on a diagnostic microevaluation since certain
system components may benefit from the spotlight effect while others are
unable to benefit. The obtrusiveness of a study cannot in any way minimize
failures attributable to the database searched (i.e. indexing and vocabulary
failures) but it might reduce failures relating to the exploitation of the
database since a searcher, knowing he is observed, may put more effort into
his interaction with the user and into the construction of the search strategy
itself.
The evaluation of an operating information system usually requires many
more compromises than the evaluation of an experimental system. To begin
with, we probably don't want to evaluate all searches conducted (even all
conducted within a restricted time period) but only a sample of these
searches. Ideally we would like to draw these searches completely at random.
But in a national system, with potential users spread over great distances, a
purely random assignment may be impracticable. The difficulties of dealing
remotely with many geographically dispersed users may be too great. Instead
of drawing a completely random sample of users, we may have to be content
with some compromise. In his evaluation of MEDLARS, for example,
Lancaster15 identified a number of organizations whose members, based on
records of searches conducted in the past, might be considered to form a
microcosm of the complete user population. Not only could these organiza-
tions, collectively, be expected to generate the required number of searches,
but the distribution of their searches by subject could be expected to resemble
rather closely the subject distribution of all requests from whatever source.
Defining a search to be evaluated as one coming from a selected group of
organizations greatly facilitated the conduct of the evaluation since contacts
with the requesters, including distribution of the necessary evaluation forms
and other materials, could be entrusted to librarians or other information
specialists on the staff of these organizations. Moreover, with a limited
number of organizations involved, it was possible to secure agreement to co-
operate from the executive officer of each organization. This was an
encouragement to the co-operation Qf the individual staff members without
in any way influencing the type of re4uests they made to the system.
The problems involved in securing the co-operation of large numbers of
users of informations services has encouraged the use of `realistic simulations'
of these services in certain evaluation applications. In such simulations a
`proxy' of a real user is employed. The proxy behaves in a way that is assumed
to be typical of the behaviour of a real user and the performance of the system
in relation to the needs of the proxy is evaluated. One example of such a
simulation is the document delivery test (Orr et al.3). In this test, 300 citations,
presumed representative of the document needs of the users of a particular
centre, are checked against the centre on a particular day to determine (a)
how many of the items are owned, and (b) how available each owned item is
on that day. A similar test has been described by De Prospo et al.'7. In
essence, the document delivery test simulates 300 users walking into the
centre on a particular day, each one seeking a particular document. Another
form of simulation is the use of a set of questions for which complete and
correct answers are known to test the question-answering ability of an
information centre. The set of test questions can be applied to the centre