IRE
Information Retrieval Experiment
Retrieval system tests 1958-1978
chapter
Karen Sparck Jones
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
V
Methodological and substantive achievements 247
document indexing options, Evans' investigated a fairly heterogeneous
selection of search strategy options.
More generally, while it might be claimed that multi-collection experiments
comparing search procedures like those of Sparck Jones95 represent some
methodological and substantive advance compared with Dale and Dale's test
of some ten years earlier21, Atherton's BOOKS test84 does not represent
much advance methodologically over Cranfield 1 (Ref. 1), and is indeed
substantively focused on the same system component, document indexing.
Js it nevertheless possible to point to any general methodological and
substantive advances?
To take methodology first. It is possible to point to a general advance in the
quality of retrieval experiments. Specifically, we can say that experiments
now are more likely than they were twenty years ago:
(1) to use real data, for example not requests based on source documents;
(2) to have enough data, for example fiffy requests rather than ten, and five
thousand documents rather than five hundred;
(3) to introduce more control in relation to data variables, for example by
using more than one collection;
(4) to discriminate better among mechanism variables, for example by
distinguishing language specificity from indexing specificity;
(5) to utilize more appropriate performance measures, for example by
interpreting recall in relation to sets of documents rather than single
sought documents;
(6) to conduct tests more carefully, for example by utilizing Latin square
designs for assigning tasks to people;
(7) to evaluate findings properly, for example by applying significance tests.
But, as the survey has shown, not all experiments meet these conditions.
The effort of conducting proper experiments, which proposals like those for
the `ideal test collection' were intended to reduce, remains very great, so
many tests are limited in scope. Again, though tests appear to reflect a
growing consensus, for example in the use of recall and precision, many test
reports suggest that little attempt has been made to learn from the experience
of previous workers. Where such complicated matters as techniques for
deriving recall/precision graphs are concerned the lack of rigour is not
surprising, but it is still unfortunate.
These defects of current test methodology are nevertheless really only the
manifestation of deeper problems about the substantive aspects or retrieval
systems. Differences of test design in part reflect genuine differences of test
purpose and emphasis. Thus there is no very good reason why an efficiency
oriented test to determine indexing speed under different working conditions
should have much in common with an effectiveness oriented test to evaluate
the utility of not-logic in boolean searching. However there are many features
of information retrieval systems which are not sufficiently understood, even
at the level of reliable description, let alone analytical modelling: appropriate
choices of performance measure are an example.
Looking at the substantive side of retrieval systems, what contributions
have twenty years of testing made to system understanding and hence system
design? As pointed out earlier, statements here can only be very general ones.
Thus it appears that the tests which have been carried out show: