IRE Information Retrieval Experiment Laboratory tests of manual systems chapter E. Michael Keen Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. I...- 152 Laboratory tests of manual systems Gathering searcher preferences after exposure to a number of systems hu been practised by researchers. One technique is to pose questions on critcrli that are in fact being measured, to correlate subjective responses with mofi objective results. This has yet to be done accurately on the basis of individual responses and scores, rather than averages. This useful technique may eventually be used to see to what extent people's perception of performance matches with reality. Desi[OCRerr]n of experiments Free strategy searching on several systems faces the problems of many scientific experiments. The approach usually employed in informatioi[OCRerr] retrieval has been: (1) Every request in the set must be searched against every system an equal number of times. (2) No searcher must process any request more than once. (3) Each searcher must conduct an equal number of searches on each system in a balanced manner during the test. This has led to the use of Latin square designs: Cranfield 1 adopted thle approach for search round one though it was not described as such and in practice there were only three searchers for the 4 x 4 square, with one pers()fl repeating the same requests after at least a one month time interval. Thu careful approach led to a comprehensive statistical appendix which ii frequently overlooked2. Recent tests have used a similar approach and have looked for statistical significance using non-parametric tests such as the Sign test and Wilcoxon's Signed Ranks test. The practicalities of conducting such experiments include the usual warming-up operations to minimize the learning effect. But more markec than this effect has been an end of session mixture of perfectionism an( fatigue. In the Off-shelf test the last search received an increased time an( resulted in less entries retrieved: a clear indication for future experiments I[OCRerr] include one or two dummy searches with which to terminate. A more seriou problem is that the use of within-subjects designs cannot avoid some carry over effect, thus lessening the real differences in the systems measurement[OCRerr] Separate-subjects designs are often used in experimental psychology, S information retrieval researchers need to be more adventurous in this are[OCRerr]' EPSILON made a special study of problems of design and statistics31, an discussion of these matters is to be found in Chapter 5. Search diagnosis techniques Post-search analyses of reasons for performance obtained are a vital part operational testing and have also proved useful in laboratory testin Analyses have usually been confined to searches in which failures occurre divided into recall and precision failures. Success analyses might be enlightening addition. Cranfield I conducted many analyses of recall failures, and overall resu showed 22 per cent due to searching, 67 per cent to indexing and 11 per cc to the index languages. Searching had a larger share of failures in analy.[OCRerr]