IRE
Information Retrieval Experiment
Laboratory tests of manual systems
chapter
E. Michael Keen
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
Controlling searching in experiments 149
there was the need to trace out what was being read by a moving finger so
that the film gave more than a head and an open page! All but the last of the
above methods were used in the Off-shelf and EPSILON tests. Darkroom
timers, which ticked obtrusively and were easily misread were soon replaced
by a continuously running digital minutes and seconds display via a television
monitor.
To aid the choice of a recording method, two central questions have to be
considered:
(I) What phenomena in the search need to be recorded? E.g. citations
perceived relevant; citations perceived irrelevant; index terms tried;
cross-references used; index pages consulted; etc.
(2) How much needs to be known about each of the recorded phenomena?
For example just how many of each, or actual identity of each one;
searchers judgement about how relevant each one was; individual time
for each one; ability to reconstruct the exact order of each event in
search; etc.
Printed index marking with a time record and simple record sheet can
achieve most of this. In EPSILON the text of the query, space for any
searcher's notes, start and finish time, and reasons for termination of the
search were the basic data on the record sheet, and as the search proceeded
each relevant citation was noted by identification number, judgement of
relevance and time. By having the index copy marked with the relevant
citation number circled, and each lead term and cross-reference timed, the
sequence of the search with the pages and content consulted could be
reconstructed by the researcher by putting the elapsed times back into order.
The one phenomenon not captured was precisely which citations were
examined and regarded as irrelevant and which were never examined at all,
though one could identify many cases where a set of index entries had
obviously been examined in their entirety. Audio-recording has the potential
of capturing all the phenomena needed, though the fear is that verbalizing
changes the pattern of search and upsets its progress against time. Analysis
of the tapes is also a problem, though in the one EPSILON use of this
method15 the searcher herself made transcripts after the searches were
completed. Similar problems are likely to face any attempt to use eye-
movement equipment.
Search performance criteria and measures
Laboratory manual tests seem to have concentrated on measuring recall,
precision, time and effort. There has been much debate about the
mathematical properties of measures, and little recognition that even the
matters of computation, aggregation and presentation can cause large
differences30. A good example of the care needed in choosing a valid measure
of a given criterion is the use of the precision ratio in testing browsable-
heuristic systems.
In iterative systems where a stack of document entries is retrieved in toto
the precision ratio is straightforward to calculate and is quite meaningful.
But in the Off-shelf test it was difficult to get the searchers to spend the time
accurately recording all the irrelevant citations they encountered, as they