IRE
Information Retrieval Experiment
Laboratory tests of manual systems
chapter
E. Michael Keen
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
Controlling searching in experiments 151
4)flC SDI experiment found the precision ratio to be maintained at 45 per cent
[OCRerr]ven throughout substantial strategy changes (Medusa17).
Another severe performance measurement problem that is illustrated by
the comparison in Figure 8.6 is the difficulty of validly comparing what are,
tn this case, different test collections by virtue of there being different indexes.
I'hc question is whether the measures used minimize the unavoidable bias
[OCRerr][OCRerr]iused by the coverage of relevant items differing in each index. The
hypothesis could be advanced that were there to be added an index of the
`iLl me retrieval efficacy as the best ones, but containing a smaller number of
relevant items, its performance curve would follow the best ones to begin
with but fall away as soon as its recall rose. However, the counter hypothesis
would be that paucity of relevant entries might make retrieval a more spread
out affair, particularly if several index issues had to be consulted, thus giving
[OCRerr]i worse curve from the beginning on a plot of this type that uses elapsed time.
a satisfactory way of comparing dissimilar systems has yet to be found.
I luman performance and preferences
It must have been quite a shock to the staff of Cranfield 1 to see their names
heading the columns of a results table in which their performance (as indexers
in this case) was open to public view. However there were no traces either of
[OCRerr]tatistically significant or practically important differences: having survived
the indexing of 18 000 documents the staff could probably index in their
`ileep! It is important to realize that this result didn't suggest that humans are
consistent in every detailed decision, but that when judged by average
performance outcome (surely the only test that matters) there were no real
&tifferences. The measurement of inter- and intra-person consistency in
indexing has been a plague and a nuisance because it has been divorced from
[OCRerr]`earch outcome yet has been used to indicate quality and even performance.
An example of the validity of taking measured performance as the criterion
of consistency was seen in an ISILT test of inter-searcher consistency. If to
he consistent two searchers had to have the same search terms, combinations
and subsearch order, then the average result would have been 0 per cent. If
terms and combinations had to agree, but subsearch order need not,
consistency would have been 13 per cent. With only term choice as the
criterion (any combinations or order) the level would have risen to 32 per
cent, a very similar level to many inter-indexer consistency results. But in
searching, the identity of the search terms is less important than the outcome:
the same documents can be retrieved by different terms. So, with retrieval of
identical documents as the criterion consistency rose to 64 per cent. Still
recalculating the same data, one could say that document identity is not as
important as amount: and if this were the criterion consistency finally
reached a high level of 81 per cent.
Individual searcher performance is important in laboratory manual
comparisons of indexes when each searcher sees each index, but cannot be
asked to repeat the same search request. In the Off-shelf experiment with six
searchers and six indexes the variation in performance of the people was less
than that of the indexes thus giving grounds for hope that skill had not
overlaid the main variable being studied. Coping with this problem is a part
of valid experimental design and statistics, an area rather neglected so far.