IRE
Information Retrieval Experiment
The methodology of information retrieval experiment
chapter
Stephen E. Robertson
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
20 The methodology of information retrieval experiment
The documents were 1400 real documents on the subject of aeronautics,
selected rather than sampled. The 221 requests were obtained by asking the
authors of selected published papers (`base documents') to reconstruct the
questions which originally gave rise to these papers.
The experimental design was quite simple: each query was searched
against every system. Since the searching part of the system was controlled
by simple rules, there was no problem in relation to replicating searches or
the order in which the systems were tried.
Measurements were made of relevance and of a number of explanatory
variables. An attempt was made to obtain complete relevance judgements.
The procedure adopted was as follows: students of the subject searched the
entire document collection (starting with titles but consulting the full
document if there seemed any possibility of relevance) against each of the
requests. Documents selected by them as possibly relevant to any request
were subject to final judgement by the author/requester, together with (a) the
references given in the base documents, and (b) documents retrieved by one
very different kind of retrieval technique.
The analysis was chiefly directed at calculating recall and precision
averages, and relating these to the variables built into the experiment
(concerning the construction of the index language) and to various
explanatory variables such as exhaustivity of indexing and specificity of the
language.
Medlars
The object of the Medlars test was to evaluate the existing Medlars system
and to find out ways in which it could be improved. A few variables were
built into the experiment, notably the form of interaction between the user
and the system, so that the results obtained with different forms of interaction
could be compared; but the main feature of the test was a detailed analysis
of the reasons for failure.
The document collection was that currently available on the Medlars
service, and consisted of about 700 000 items. 302 genuine queries were
obtained by a form of stratified sampling. Because the requests were real
ones, it was not possible to replicate searches with different forms of
interaction between system and user. Hence the comparisons in relation to
this variable had to be based on different request sets.
Relevance judgements were provided by the requesters. Since in such a
situation there could be no question of scanning the entire collection, the
testers went to considerable effort to discover some relevant documents that
had not been found by the system (such documents were necessary for the
failure analysis). The sources for these documents were (a) those already
known to the requester, and (b) documents found by Medlars staff through
sources other than Medlars or Index Medicus. Thus each requester was
asked to judge a sample of the output from the Medlars search, together with
selected documents from other sources.
After the relevance judgements had been obtained, the measurement
process continued with an analysis of failures (non-relevant retrieved and
relevant not retrieved). A classification of reasons for failure was devised.
Cranfield 2 and Medlars are two of the classic experiments, both playing
1
i
I
I
i
U