IRE Information Retrieval Experiment Laboratory tests: automatic systems chapter Robert N. Oddy Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. Laboratory programs 163 [OCRerr][OCRerr]iw data of the test collection. Full text retrieval Systems, such as STATUS30 Lind STAIRS31 fall into this category, although they do not appear to have been tested in laboratory conditions. Very recently, this sort of data has been used by Belkin and Oddy19 (who are investigating the computer modelling of imomalous states of knowledge) to generate associative structures from individual texts. Simple files of term postings, weighted or unweighted, are hot adequate for this purpose. For the most part, information retrieval test collections are small: typical i[OCRerr]umbers of documents are 200, 424, 800, 1400, 11 518; and queries number 42, 24, 63, 221, 193. Robertson has discussed the difficulties of extrapolating results obtained on such small samples in the present volume, and elsewhere22. At this point, I shall merely mention some reasons for this state of affairs. First, data collection often involves a great deal of drudgery and cost for the researcher or his assistants. Suitable documents and queries have to be selected, perhaps indexed manually, and prepared in a convenient machine-readable form. Relevance judgements must be made, either by the originator of the query, or by a subject expert. For exhaustive data, the number of decisions required is the product of the number of documents and the number of queries. The largest collection, that I am aware of, for which exhaustive relevance judgements have been made, is the Cranfield 2 test collection assembled by Cleverdon, Mills and Keen14. There are 1400 documents and 221 queries; thus 309 400 relevance decisions were made. An experimenter will naturally prefer exhaustive data to simplify the evaluation methodology, and particularly if he is concerned with a relevance feedback mechanism. Other reasons for the use of small test collections are related to the computing aspects. Many processes to which the data are subjected classification procedures, matching and ranking, for example-consume quantities of computer time which depend on collection size factors (numbers of documents, terms and queries) in a worse than linear fashion1 2 The experimenter is therefore obliged to pay some attention to computational efficiency. The most productive experimenters have made use of large, fast computers. With such equipment it is possible to hold in core storage substantial parts (or extensive derived structures) of a test collection of several hundred documents. A program will run very much faster in this circumstance than if it must make frequent reference to several disk files, and so the experimenter will obtain a speedy job turnaround from what is often an over-subscribed university computer service. Paradoxically, an experi- menter with a small computer will tend to write programs which can cope with larger files, because for him, even a small test collection is a large data structure. (But his experimental progress will, of course, be slower.) 9.3 Laboratory programs The programs used in experimental work differ in a number of ways from those which would be used in an operational environment to achieve the equivalent processes. Reasons for this are that no interface is required for a human searcher, the goals of experimental programs are not the same as operational ones, and the test collection is assumed to be relatively small and static, with known bounds for all its dimensions. It is common for an