IRE
Information Retrieval Experiment
Laboratory tests: automatic systems
chapter
Robert N. Oddy
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
Realism 167
First, there are collection parameters: of what does the raw data consist, and
how is it initially processed to form the document descriptions, queries, and
relevance judgements? Indexing can be manual or automatic, and based on
titles or abstracts; exhaustivity and specificity can vary; thesauri and
stemming procedures can be used to normalize vocabulary; weights can be
assigned to index terms; various degrees of relevance may be taken into
account. Second, there are variations in retrieval system features. Queries
and document descriptions can be matched in a number of ways; relevance
judgements relating to retrieved documents can be used in a variety of ways
to improve queries; complex structures derived from the simple collection,
such as term classes and document clusters, can be exploited in retrieval. All
of these factors can, in principle, be controlled by the experimenter, and
another way to categorize them is according to the practical difficulties of
controlling them. In an automatic laboratory information retrieval system,
any single parameter or feature, or any combination of parameters and
features, may be varied independently of all other features, whether it makes
sense to do so, or not! Factors which can be changed by small to moderate
amounts of computer programming, of course, are the ones presenting least
difficulty. These include retrieval system features and some of the indexing
parameters. Not only is it practically straightforward to vary these factors,
but it can also be done very precisely; in fact, it must be so done because the
`values' are coded into programs. Difficulties arise when controlled variability
is desired in the characteristics of any data which is generated intellectually.
Instances are the use of (conventional) thesaural relations in manual indexing,
and the judges and scales used for relevance data. Precise control of these
factors is not possible in the same way as for computational factors, because
they are not so easily quantifiable or, in the case of procedural factors,
specifiable. In addition, alternative forms of the test collection are required
for different values of these variables, involving the experimenter in
considerable effort and expense. Consequently, very little experimentation
has been done with such variables.
It is the hope of the experimenter, whether engineer or theoretician, that
results obtained in the laboratory would also be obtained in real life, should
an equivalent experiment be conducted. If that were so, he would be in a
position to make very precise recommendations concerning the organization
of information within the system and the retrieval algorithms which would
optimize performance. Unfortunately, this extrapolation is highly proble-
matic. In the first place, the scale of most test collections is very different
from that of operational databases. I shall not dwell on the very difficult
statistical problem of extrapolation here, but refer the reader to Robertson's
chapter (2) in the present volume, and to the illuminating discussion in his
thesis22. It is true that some laboratory tests have used collections of the order
of 10 000 documents5' 6, 36, but these are deficient in relevance information,
and therefore difficult to use for retrieval tests.
9.5 Realism
Other problems have to do with realism: there are aspects of real life
information retrieval activities, mostly to do with user behaviour, which