IRE
Information Retrieval Experiment
Evaluation within the enviornment of an operating information service
chapter
F. Wilfrid Lancaster
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
Some problems of evaluation applied to operating Systems 121
The evaluation of an operating information service is likely to involve
many more compromises than the evaluation of an experimental system. In
the latter case, for example, we might be able to determine the `true' recall
ratio for a search since it may be possible to have every document in the
collection judged for relevance against every request used in the evaluation.
In a real operating environment, however, it is impossible to establish true
recall (Lancaster5) and we must instead be satisfied with some method of
estimating the recall ratio of the search.
There is likely to be a difference, too, between the evaluation standards
appropriate to the experimental and to the operating environments. In the
former, it is impossible to evaluate the results of a literature search against
information needs. Since there are no real users, there are no real information
needs. The best we can do is to evaluate the results of a search against a
request statement (relevance as opposed to pertinence, Lancaster5). But this
is not good enough in the operating environment. The evaluation of a search
against a request statement is an artificial situation. Since we have real users
with real needs we must ask these users to evaluate the results of a search in
terms of the degree to which they contribute to the satisfaction of the
information need that prompted the request to the system.
The problems of controlled experimentation within an operating environ-
ment have already been mentioned. In a purely experimental situation it
should be possible to control all extraneous variables so that one can be quite
sure of what is affecting what. In a real life environment it is not so easy to
experiment. One finds oneself always compromising between the experimen-
tal design and concern for the needs of the users. For example, if a completely
new information service is introduced, one that promises to be much more
effective than any of its predecessors, it is difficult to explain to members of
a `control group' why they are denied use of the service. Yet, if we really want
to assess the impact of the service, some type of control group of this kind will
be necessary. There may always be self-selected control groups (e.g. those
people who choose not to use a new service) but a self-selected group is likely
to be quite different from a group that is randomly selected to form a control.
A rare example of a true experimental design in the evaluation of various
approaches to the provision of information services can be found in a recent
paper by Olson23. The design used was a 3>( 3 factorial design as illustrated
in Figure 6.1. Two levels of `technical information intervention' and two of
Behovioural interventions
Control Level 1 Level 2
Control
Technical
information Level 1
interventions
Level 2
Figure 6.1. Factorial design: levels of interventions (from Olson23).