IRE
Information Retrieval Experiment
The methodology of information retrieval experiment
chapter
Stephen E. Robertson
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
22 The methodology of information retrieval experiment
means of these rules to human-assigned indexing, and to a simpler automatic
method. The comparison was intended to be indicative of the possible quality
of the technique, rather than a definitive test of the theory.
Because of the scale of the project, and because of the difficulty of setting
up a test of a complete information storage and retrieval system using
different kinds of indexing, Harter decided to restrict the experiment to the
indexing stage only. He therefore required a collection of documents, indexed
by a human indexer, but no actual requests. The document collection also
had to be in some sense realistic (as collection, not just as individual
documents), and the documents had to be in continuous text form. Harter
chose an existing collection of 650 abstracts of the work of Sigmund Freud.
The whole collection was used for the statistical analysis of term occurrence
on which the automatic indexing rule was based, and the actual comparison
of index terms was carried out on a random sample of 38 of these.
For the purposes of the experiment, the human-assigned indexing was
regarded as the norm, and the object of the two automatic methods was
assumed to be to duplicate as far as possible this norm. Thus the rationale of
the experiment depends heavily on the assumption that the human-assigned
indexing is `good'. Indeed, one might regard this procedure, in terms of the
archetype, as using the set of human-assigned index terms as artificial single-
term queries, and the human assignments as relevance judgements. Harter's
use of a genuine collection of documents but highly artificial queries is
justified by the aims and circumstances of the test.
So Harter and Oddy each chose to make certain aspects of their respective
experiments as realistic as possible, but to allow artificiality in others, in
effect selecting from the archetype in a manner appropriate to their objectives
and resources.
Portable test collectlons
It will be clear from all that has gone before that any retrieval test involves
a considerable amount of effort, much of which goes into setting up the test
collection-that is, the collection of documents, requests and relevance
judgements. Even in an operational environment, where the document
collection (with indexing) is given, the queries have to be trapped at a
suitable point, and the relevance judgements obtained. Many laboratory
tests also involve some kind of indexing; and in any case, laboratory testers
seldom have easy access to sources of queries and relevance judgements.
For these reasons, it has become common for complete test collections to
be passed from researcher to researcher, and re-used many times. The best-
known collection to suffer this fate is certainly the collection used in the
second Cranfield experiment, described above; indeed, it would be fair to say
that this collection has been grossly over-used, in the sense that it has been
used for experiments which were far removed from those for which it was
designed. On the other hand, given the existence of such a collection, a
researcher in a laboratory environment is unlikely to feel justified or
motivated to set up a new one.
There are in fact a number of collections which are used in this way:
indeed, there are researchers who have become, defacto, the brokers for such
collections, notably K. Sparck Jones in the UK and 0. Salton in the USA.