IRE
Information Retrieval Experiment
Evaluation within the enviornment of an operating information service
chapter
F. Wilfrid Lancaster
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
118 Evaluation within the environment of an operating information service
recall and precision failures. The failure analysis itself entails an examination
of each document involved, the indexing records for the documents, the
requests that caused the searches to be conducted, the search strategies,
the system vocabulary, and the relevance assessments of the users. Through
the examination of each of these it should be possible to determine which
component of the system was largely responsible for the failures occurring.
In addition to the analysis of the failures occurring in particular searches, the
evaluator can use the recall and precision ratios, or alternative measures of
search performance, as indicators of conditions under which the system
seems to perform well and under which it seems to perform badly. For
example, searches can be grouped by broad subject category, and an average
performance figure, or figures, can be derived for each group. It would then
be possible to identify subject areas in which unusually low scores occur.
Through the joint use of performance figures, in this way, and analyses of
failures in particular searches, the evaluator learns a great deal about the
characteristics of the system, its weaknesses and limitations as well as its
strong points. The joint use of the performance figures and failure analyses
should answer most of the questions identified in the work statement for the
evaluation. The final element in the analysis and interpretation phase is that
in which the evaluator presents his report to the managers of the system,
including in his report recommendations on what might be done to improve
its performance. The fifth and final step of the evaluation programme is that
in which some or all of the recommendations are implemented, that is, the
step in which the evaluation results are applied to the improvement of the
system.
Although not specifically mentioned in the discussion above, the value of
a pretest should be recognized. Before the complete evaluation is carried out,
it is important to follow through all the proposed procedures on a small
sample of transactions, to ensure that the procedures are, in fact, viable and
that they are capable of gathering the data needed to complete the study.
One obvious problem associated with an operating system is the fact that,
because a rather high level of user co-operation may be demanded (e.g. in
assessing the relevance of items retrieved), it is virtually impossible to
conduct the evaluation unobtrusively. Not only will the users know they are
participating in a study but, since we may need their co-operation in
collecting and delivering various records, the staff may also be aware that an
evaluation is taking place. How much effect this obtrusiveness is likely to
have on the evaluation results is a matter of some debate. It could be argued
that the overall performance figures for `observed' searches may exceed, on
the average, overall performance figures for unobserved searches. In this
sense, the evaluation may be considered a study of the system under
somewhat ideal conditions. On the other hand, the effect of the obtrusiveness
may be considered to apply equally to all searches included in the evaluation.
This being so, the obtrusive nature of the study may not have any significant
effect for certain evaluation purposes. For example, the obtrusiveness of aj
study is unlikely to affect a comparison of performance across various subject
fields since there is little reason to suppose that the obtrusiveness will affe ctI
one subject area more than another. This is somewhat similar to the findings
of Lesk and Salton16 that even major differences in relevance decisions, from
one judge to another, had no significant effect on comparison of the
1