IRE
Information Retrieval Experiment
Evaluation within the enviornment of an operating information service
chapter
F. Wilfrid Lancaster
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
120 Evaluation within the environment of an operating information service
obtrusively or unobtrusively (Bunge1 8, Powell' 9, Crowley and Childers4,
King and Berry20).
On the surface, simulations of this kind may be regarded as imperfect
substitutes for real life studies. But not all advantages lie with the real life
situation. To begin with, simulations do not disturb the users of the system;
they are also likely to be considerably cheaper than the real life study.
Moreover, it could be argued that a real life study tells us only how the system
performs in relation to actual demands and tells us nothing about the potential
performance of the system in relation to the latent needs that may never be
converted into demands. There are obvious dangers associated with looking
at demands only (Line21, Lancaster22) since the demands (expressed needs)
of users are likely to be influenced by their expectations of the capabilities of
the system. Evaluation of a service in relation to expressed needs, with no
concern for information needs that are unexpressed, may cause managers of
an information service to move that service further towards the expressed
needs and further away from the unexpressed needs. A simulation such as the
document delivery test, insofar as this simulation can be assumed to reflect
latent needs as well as expressed needs, offers certain advantages over an
evaluation that looks at expressed needs only.
A major difference between the evaluation of an operating information
service and the evaluation of a system in a laboratory environment is that the
latter, lacking real users with real needs, must adopt some `ideal' as the
appropriate standard for performance. But the real service should not be
evaluated against an ideal but only against the real needs of users, which may
be much less than the ideal. An obvious example is the use of such measures
as recall and precision ratios. In the experimental environment it is
reasonable to evaluate the performance of a system in relation the ideal of
100 per cent recall and 100 per cent precision. But the ideal is not necessarily
the best measure to use in the evaluation of an operating system. It makes no
sense to regard as a failure a search that achieves only, say, 20 per cent recall
if the requester does not require a high level of recall and is perfectly satisfied
with `a few good relevant items'. In an operating environment, evaluation
measures must always be related to the precise needs of the users of the
service.
Moreover, evaluation measures that seem perfectly appropriate in the
experimental situation may be inappropriate to the operational situation
because better (e.g. more direct) measures may exist. The precision ratio is
a good example of this. In effect, the precision ratio is a user cost factor
associated with achieving a particular level of recall (another way of looking
at it is as a form of penalty associated with the attainment of a particular
recall ratio). This ratio may be the only meaningful cost factor to use in an
experimental environment. But in many operating environments there may
be much more direct cost factors, such as the unit cost (in $$ or time) to the
user per relevant item retrieved. Thus, if a user pays $25 for the results of an
online search, and finds five relevant documents in search output, he is
paying $5 for each relevant item retrieved. If he conducts his own search in
a printed index, and finds six relevant items in 2 hours, we could say that the
unit cost (in time) per relevant item retrieved is 20 minutes. These are much
more direct measures of the cost of achieving a particular level of recall than
is the precision ratio.