IRE Information Retrieval Experiment Laboratory tests of manual systems chapter E. Michael Keen Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. Controlling searching in experiments 151 4)flC SDI experiment found the precision ratio to be maintained at 45 per cent [OCRerr]ven throughout substantial strategy changes (Medusa17). Another severe performance measurement problem that is illustrated by the comparison in Figure 8.6 is the difficulty of validly comparing what are, tn this case, different test collections by virtue of there being different indexes. I'hc question is whether the measures used minimize the unavoidable bias [OCRerr][OCRerr]iused by the coverage of relevant items differing in each index. The hypothesis could be advanced that were there to be added an index of the `iLl me retrieval efficacy as the best ones, but containing a smaller number of relevant items, its performance curve would follow the best ones to begin with but fall away as soon as its recall rose. However, the counter hypothesis would be that paucity of relevant entries might make retrieval a more spread out affair, particularly if several index issues had to be consulted, thus giving [OCRerr]i worse curve from the beginning on a plot of this type that uses elapsed time. a satisfactory way of comparing dissimilar systems has yet to be found. I luman performance and preferences It must have been quite a shock to the staff of Cranfield 1 to see their names heading the columns of a results table in which their performance (as indexers in this case) was open to public view. However there were no traces either of [OCRerr]tatistically significant or practically important differences: having survived the indexing of 18 000 documents the staff could probably index in their `ileep! It is important to realize that this result didn't suggest that humans are consistent in every detailed decision, but that when judged by average performance outcome (surely the only test that matters) there were no real &tifferences. The measurement of inter- and intra-person consistency in indexing has been a plague and a nuisance because it has been divorced from [OCRerr]`earch outcome yet has been used to indicate quality and even performance. An example of the validity of taking measured performance as the criterion of consistency was seen in an ISILT test of inter-searcher consistency. If to he consistent two searchers had to have the same search terms, combinations and subsearch order, then the average result would have been 0 per cent. If terms and combinations had to agree, but subsearch order need not, consistency would have been 13 per cent. With only term choice as the criterion (any combinations or order) the level would have risen to 32 per cent, a very similar level to many inter-indexer consistency results. But in searching, the identity of the search terms is less important than the outcome: the same documents can be retrieved by different terms. So, with retrieval of identical documents as the criterion consistency rose to 64 per cent. Still recalculating the same data, one could say that document identity is not as important as amount: and if this were the criterion consistency finally reached a high level of 81 per cent. Individual searcher performance is important in laboratory manual comparisons of indexes when each searcher sees each index, but cannot be asked to repeat the same search request. In the Off-shelf experiment with six searchers and six indexes the variation in performance of the people was less than that of the indexes thus giving grounds for hope that skill had not overlaid the main variable being studied. Coping with this problem is a part of valid experimental design and statistics, an area rather neglected so far.