IRE
Information Retrieval Experiment
The methodology of information retrieval experiment
chapter
Stephen E. Robertson
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
Components of the archetype 13
need, rather than simply to present documents in response to a formal
request.
The documents and requests
Almost all information retrieval system tests use genuine documents.
Are there alternatives? It is possible to do some simulation experiments
using pseudo-documents which are generated in some fashion (perhaps
involving Monte Carlo techniques) so as to imitate real ones in some specific
sense (they may, for example, imitate only the index sets associated with
documents, rather than the documents themselves). There is certainly a role
for such experiments, to answer specific research questions; but for most
purposes it is both better and easier to use the real thing. There is certainly
no shortage of documents around!
The only remaining question, then, is: Which documents? Here we run
into the vexed question of sampling, about which more below.
With regard to requests, the situation is considerably more problematic.
To be sure, there is (in principle, at least) no shortage of requests; the
problems with obtaining such real requests are several. First, we have to
catch them! Requests (in the sense of acts of requesting information) exist
only for a short space of time, and have to be trapped at that time. Second,
the actual locations of these request-acts are usually dispersed; a mechanism
for trapping them that is located at one place may take a long time to trap a
reasonable number or range of requests. Third, most test designs require, to
a greater or lesser extent, the co-operation of the requester. This co-operation
may be needed in connection with the operation of the system itself; it may
also be needed for the measurement of system output (as discussed below).
Fourth, all the three previous difficulties combine to exacerbate the problem
of obtaining a sample of requests that is representative of anything.
These problems have prompted many testers to construct artificial queries.
Such artificial queries may vary in their degree of realism. Some examples
are discussed below.
Experimental design
Broadly, experimental design is concerned with arranging matters so that the
experiment does answer the question(s) it is intended to answer. Obviously
all the other components discussed above also come under this broad
heading. But the phrase is used in a somewhat narrower sense, referring to
those aspects of the design that determine whether, from a logical or
statistical point of view, appropriate inferences can be drawn. The simplest
question one can ask in this context is: How large a sample of documents
(say) do I need for this experiment? A more detailed example would be: How
should I use the different (human) searchers available, with the different
requests and alternative systems, so as to separate the effects I want to
measure (the difference between the systems) from those that I don't (any
differences between searchers or between requests, or interactions between
searchers and/or requests and/or systems)?
Even within this narrower scope of experimental design, it has many