IRE
Information Retrieval Experiment
The methodology of information retrieval experiment
chapter
Stephen E. Robertson
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
28 The methodology of information retrieval experiment
now or in the future. Is there any way of ensuring that this set of artificial
queries is representative of a real situation?
Some ways of obtaining artificial queries are: to ask some actual or
potential users for examples of queries they have put to any system in the
recent past; to ask intermediaries (librarians or information officers) for
examples of the sorts of queries put to them; to construct queries by some
random choice of index terms, in such a way as to duplicate known statistical
characteristics of real queries. Any such method obviously has at its heart an
intention to produce artificial queries which in some sense `look like' real
ones. The problem is, we do not really know what are the important
characteristics of real ones that we should be trying to reproduce.
To a limited extent, one may test for the representativeness of artificial
queries provided there are some real queries available in some form, by
looking at various measurable characteristics of the real and artificial sets
(any characteristics that one can think of). If a bias is detected in this way,
it may be possible to allow for it in the analysis of results. But such procedures
are of only limited value.
One possible statistical justification for using artificial queries is that we
could, in principle, generate to order queries of a range of different types
(that is, we could directly control some of the variables associated with
queries). This would be a stronger justification if we possessed a reasonable
typology of queries; at the moment, no such typology exists.
Laboratory tests
Again, in the foregoing discussion I have assumed that there is a specific
situation about which we wish to make inferences (even if it is a postulated
future situation). In laboratory tests, where we wish to make generalizations
about system design, this is not the case. How then can we begin to make
inferences?
If we have two alternative general hypotheses, then we can test them
against each other by the usual scientific methods. That is, we have to devise
an experiment from which the two hypotheses would predict different results.
Because of the vagaries of individual documents and queries, almost any
general hypothesis is bound to include some (explicit or implicit) statistical
element in its specific predictions: that is, no general hypothesis in
information retrieval (of any importance, at least) can be expected to make
deterministic predictions. (To take an extreme example, we would not expect
to be able to support or disprove a general hypothesis on the basis of a test on
two documents and one query.) Thus we can expect statistical considerations
to play a part in such hypothesis testing.
The problem is, how should we think about the experimental set-up in
statistical terms? We have to regard the documents and queries as a sample
from something (whether or not they actually are). The only way we can do
so in general is to consider the criteria by which they were selected (or
constructed), and define notional populations of all the (actual or potential)
documents or queries satisfying these criteria. Then we can hope to make
inferences about whether or not any general hypothesis holds for these
notional populations.
As in any scientific field, although we might possibly reject a general