IRE
Information Retrieval Experiment
Laboratory tests: automatic systems
chapter
Robert N. Oddy
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
160 Laboratory tests: automatic Systems
(2)
(3)
(4)
(5)
theoretical considerations which motivated the system design. (I am
thinking, particularly, of relevance feedback techniques8' 9.)
Conceptual simphfication. Very often, extremely complex phenomena, to
which the behaviour of real life systems is closely linked, enter the
laboratory test as simple abstractions: for example, an information need
may become a list of index terms, and a relevance judgement a truth-
value. The significance that we place upon test results will depend upon
the appropriateness of these abstractions. This problem is not specific to
automated laboratory testing, but I wish to mention it here because it is
more easily brushed aside when users are not present, reminding one of
the simplifications that have been made.
Extrapolatjon problems. It is usually necessary to work with small samples
of document collections and queries in the laboratory, so that it is feasible
to assemble complete relevance judgements, and so that the demands on
computer resources are not excessive. The statistical problem of
extrapolating from laboratory results to real life systems is severe. This
chapter is not directly concerned with the problem (see Chapter 2); but
it should be borne in mind in the present context.
Technical faults. It is remarkably difficult to ensure the correctness of a
computer program of any substantial size, and offen it is not at all
obvious from its behaviour that a program is faulty. In principle,
therefore, we have the problem of demonstrating the credibility of test
results.
Communication dzfficulties. A detailed account of the workings of a
program makes for very heavy reading, and is offen unnecessary in
research papers: a summary, or description by analogy is usually more
illuminating. There is, however, a danger of ambiguity if programs so
described contain unobvious interpretations of the theory, or ad hoc
procedural `theories'.
Before I attempt an assessment of the methodology, I shall give a fairly
detailed description of the nature of data and programs used in the
mainstream of automatic laboratory information retrieval testing.
9.2 The test collection
The data used for information retrieval laboratory testing has been the
subject of some discussion recently20-22. Retrieval experiments make use of
what are commonly referred to as `test collections'. Such data consists of a
static collection of document descriptions, queries and relevance judgements.
The numbers of documents and queries are usually small so that the labour
required for setting up the test collection (particularly for obtaining complete
relevancejudgements) is kept within reasonable bounds. This has had serious
consequences for the acceptability of the results obtained in laboratory
situations, and it is consideration of this problem that has lead Sparck Jones
and van Rusbergen23 to propose that a large `ideal' test collection be designed
for use by a wide range of experimenters. I shall return to this idea presently,
but should like first to deal with the small test collections traditionally used
in tests of automatic information retrieval systems.
The raw data for a test collection can take a number of forms. The whole