IRE
Information Retrieval Experiment
The methodology of information retrieval experiment
chapter
Stephen E. Robertson
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
Some examples 21
a major part in creating the archetype as I have described it. Clearly they lie
at opposite poles of the operational-laboratory spectrum, Cranfield 2 being
It highly controlled and artificial experiment, and Medlars being an
investigation (in the sense defined in the editor's introduction) of an
operational system, as far as possible under realistic conditions. However,
hetween them they illustrate well the main characteristics of the archetype.
In particular, Cranfield illustrates the necessity for having a complete
system, even if only part of it is under test. Both tests used genuine
d(lcuments; Medlars used genuine queries and Cranfield artificial (or
reconstructed) ones. Cranfield used an experimental design involving
replicated searches; Medlars could not. Both tests used relevance judgements
hy the requester; in both cases this precluded exhaustive scanning of the
collection, though for Cranfield one might assume that the relevance sets are
[OCRerr]lmost complete; and so on.
The two experiments described below are on a smaller scale, with more
limited objectives (each, in fact, forming part of a PhD project).
Oddy: Thomas
R. N. Oddy developed a program for computer searching, called Thomas,
with a strong interactive facility. The basic idea was that the system should
huild up, from its dialogue with the user, an internal image of the user's need.
oddy conducted a test of the program, designed to establish its feasibility
Lind some approximate idea of its quality, rather than to measure in any very
reti ned sense its performance.
For the document collection, a selection of 225 references (complete with
:n(1exing) was made from the Medlars data base; for the queries, 32 searches
rc[OCRerr]'ulting from genuine requests put to the Medusa system were used. Since
(clevance judgements on the output of Medusa searches were obtained as a
ili[OCRerr]ittcr of course by the system, these were available to Oddy for the test of
l~homas.
Uhe test itself involved simulating a user interaction with the system, on
the basis of all the information available to Oddy (statement of the request
[OCRerr] record of the search process on Medusa, and relevance judgements on
tt[OCRerr]e output). Clearly this information is incomplete, in respect of both the
`IeLlrch process that the user might have followed with Thomas and the
relevance judgements (the relevance judgements affect the search process as
well as the evaluation of the results). Also the very limited selection (not
`iLimple) of documents makes generalizing from such an experiment even
Illore dangerous than usual. However, in the con[OCRerr]ext of the limited aims of
tile test, Oddy's methods are appropriate. His insistence on using genuine
(eq uests and relevance judgements, while remaining unconcerned with the
Lirtificiality of other aspects of the test, is strictly in keeping with the
philosophy of Thomas, and seems eminently reasonable in the circumstances.
Ilarter: probabilistic indexing
S P. Harter has developed a theory which can be used to derive rules for
LItitomatic extraction indexing. In order to subject the theory to test, Harter
perf[OCRerr]rmed an experiment comparing the indexing derived automatically by