IRE
Information Retrieval Experiment
Introduction
chapter
Karen Sparck Jones
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
2 Introduction
have been obtained and experience gained, a review of this research serves
mainly to show how much more needs to be done. Individual experiments are
still far too often methodologically inadequate, in some cases in obvious
matters of design, in others in less obvious but nevertheless very important
matters of scale.
Over the years, the character of experimental work in information retrieval
has changed. Many early tests focused on the new information processing
methods associated with automation: they were concerned either with the
application of post-coordinate searching using manual indexing, or more
ambitiously, with wholly automatic indexing techniques. However the main
outcome of this work was the discovery that nobody really knew, in detail,
how document retrieval systems behave or more importantly, why they
behave in the way they do. Many test results were unexpected or ambiguous,
and it became evident that this was due to weak experimental design, which
was in turn a consequence of unexamined assumptions about retrieval
systems. The need for a better standard of experiment, preferably informed
by an explicit characterization of system properties, has been slowly
recognized, and the quality of experiments has improved. Some research
workers have in particular been able to capitalize on previous work through
the use of common data, and also through the application of common
techniques for system performance measurement. Thus although specific
tests may have been done within the framework of particular operational
systems and their assumptions, much of the research that has been done has
been devoted to a painstaking analysis of typical system behaviour, over a
wide range of factors.
These developments are in many ways illustrated by the objectives of the
Cranfield 2 test, its results, and its subsequent influence on information
retrieval research. The test was designed to compare the performance of
different indexing languages, including simple natural language, in a
laboratory environment. It was intended to improve on Cranfield 1 in
experimental design, and was systematically conducted. The most striking
result, the competitive performance of post-coordinate natural language
terms, was not expected, but has been largely supported by subsequent
experiments. The test also indicated that only medium levels of performance
could be expected of retrieval systems. The inverse relationship between
recall and precision was clearly displayed, and subsequently adopted as a
generalization about retrieval systems, if not as a law.
The test was criticized for methodological inadequacies, for example in
the way the test data was generated, and for being too small in scale. Further,
while the test's concern with index languages and their application was
clearly focused on the centre of retrieval systems, important factors, especially
those involving users, were not studied. It has also been argued that recall
and precision have been overvalued as measures and, more generally, that
the bottom-up approach to the understanding of retrieval systems represented
by Cranfield-type experiments may be unproductive or misleading, and that
a top-down approach, guided by a theory or model, is preferable.
The impact of the Cranfield 2 test on later research has nevertheless been
considerable. Specific projects have applied Cranfield procedures and used
the Cranfield test data for comparative purposes. More broadly, much of the
experimental research done since Cranfield 2 has been in the same tradition,