IRE Information Retrieval Experiment Laboratory tests of manual systems chapter E. Michael Keen Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. Controlling searching in experiments 149 there was the need to trace out what was being read by a moving finger so that the film gave more than a head and an open page! All but the last of the above methods were used in the Off-shelf and EPSILON tests. Darkroom timers, which ticked obtrusively and were easily misread were soon replaced by a continuously running digital minutes and seconds display via a television monitor. To aid the choice of a recording method, two central questions have to be considered: (I) What phenomena in the search need to be recorded? E.g. citations perceived relevant; citations perceived irrelevant; index terms tried; cross-references used; index pages consulted; etc. (2) How much needs to be known about each of the recorded phenomena? For example just how many of each, or actual identity of each one; searchers judgement about how relevant each one was; individual time for each one; ability to reconstruct the exact order of each event in search; etc. Printed index marking with a time record and simple record sheet can achieve most of this. In EPSILON the text of the query, space for any searcher's notes, start and finish time, and reasons for termination of the search were the basic data on the record sheet, and as the search proceeded each relevant citation was noted by identification number, judgement of relevance and time. By having the index copy marked with the relevant citation number circled, and each lead term and cross-reference timed, the sequence of the search with the pages and content consulted could be reconstructed by the researcher by putting the elapsed times back into order. The one phenomenon not captured was precisely which citations were examined and regarded as irrelevant and which were never examined at all, though one could identify many cases where a set of index entries had obviously been examined in their entirety. Audio-recording has the potential of capturing all the phenomena needed, though the fear is that verbalizing changes the pattern of search and upsets its progress against time. Analysis of the tapes is also a problem, though in the one EPSILON use of this method15 the searcher herself made transcripts after the searches were completed. Similar problems are likely to face any attempt to use eye- movement equipment. Search performance criteria and measures Laboratory manual tests seem to have concentrated on measuring recall, precision, time and effort. There has been much debate about the mathematical properties of measures, and little recognition that even the matters of computation, aggregation and presentation can cause large differences30. A good example of the care needed in choosing a valid measure of a given criterion is the use of the precision ratio in testing browsable- heuristic systems. In iterative systems where a stack of document entries is retrieved in toto the precision ratio is straightforward to calculate and is quite meaningful. But in the Off-shelf test it was difficult to get the searchers to spend the time accurately recording all the irrelevant citations they encountered, as they