IRE
Information Retrieval Experiment
Laboratory tests: automatic systems
chapter
Robert N. Oddy
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
Laboratory programs 163
[OCRerr][OCRerr]iw data of the test collection. Full text retrieval Systems, such as STATUS30
Lind STAIRS31 fall into this category, although they do not appear to have
been tested in laboratory conditions. Very recently, this sort of data has been
used by Belkin and Oddy19 (who are investigating the computer modelling of
imomalous states of knowledge) to generate associative structures from
individual texts. Simple files of term postings, weighted or unweighted, are
hot adequate for this purpose.
For the most part, information retrieval test collections are small: typical
i[OCRerr]umbers of documents are 200, 424, 800, 1400, 11 518; and queries number
42, 24, 63, 221, 193. Robertson has discussed the difficulties of extrapolating
results obtained on such small samples in the present volume, and
elsewhere22. At this point, I shall merely mention some reasons for this state
of affairs. First, data collection often involves a great deal of drudgery and
cost for the researcher or his assistants. Suitable documents and queries have
to be selected, perhaps indexed manually, and prepared in a convenient
machine-readable form. Relevance judgements must be made, either by the
originator of the query, or by a subject expert. For exhaustive data, the
number of decisions required is the product of the number of documents and
the number of queries. The largest collection, that I am aware of, for which
exhaustive relevance judgements have been made, is the Cranfield 2 test
collection assembled by Cleverdon, Mills and Keen14. There are 1400
documents and 221 queries; thus 309 400 relevance decisions were made. An
experimenter will naturally prefer exhaustive data to simplify the evaluation
methodology, and particularly if he is concerned with a relevance feedback
mechanism. Other reasons for the use of small test collections are related to
the computing aspects. Many processes to which the data are subjected
classification procedures, matching and ranking, for example-consume
quantities of computer time which depend on collection size factors (numbers
of documents, terms and queries) in a worse than linear fashion1 2 The
experimenter is therefore obliged to pay some attention to computational
efficiency. The most productive experimenters have made use of large, fast
computers. With such equipment it is possible to hold in core storage
substantial parts (or extensive derived structures) of a test collection of
several hundred documents. A program will run very much faster in this
circumstance than if it must make frequent reference to several disk files, and
so the experimenter will obtain a speedy job turnaround from what is often
an over-subscribed university computer service. Paradoxically, an experi-
menter with a small computer will tend to write programs which can cope
with larger files, because for him, even a small test collection is a large data
structure. (But his experimental progress will, of course, be slower.)
9.3 Laboratory programs
The programs used in experimental work differ in a number of ways from
those which would be used in an operational environment to achieve the
equivalent processes. Reasons for this are that no interface is required for a
human searcher, the goals of experimental programs are not the same as
operational ones, and the test collection is assumed to be relatively small and
static, with known bounds for all its dimensions. It is common for an