IRE Information Retrieval Experiment Laboratory tests: automatic systems chapter Robert N. Oddy Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. Realism 167 First, there are collection parameters: of what does the raw data consist, and how is it initially processed to form the document descriptions, queries, and relevance judgements? Indexing can be manual or automatic, and based on titles or abstracts; exhaustivity and specificity can vary; thesauri and stemming procedures can be used to normalize vocabulary; weights can be assigned to index terms; various degrees of relevance may be taken into account. Second, there are variations in retrieval system features. Queries and document descriptions can be matched in a number of ways; relevance judgements relating to retrieved documents can be used in a variety of ways to improve queries; complex structures derived from the simple collection, such as term classes and document clusters, can be exploited in retrieval. All of these factors can, in principle, be controlled by the experimenter, and another way to categorize them is according to the practical difficulties of controlling them. In an automatic laboratory information retrieval system, any single parameter or feature, or any combination of parameters and features, may be varied independently of all other features, whether it makes sense to do so, or not! Factors which can be changed by small to moderate amounts of computer programming, of course, are the ones presenting least difficulty. These include retrieval system features and some of the indexing parameters. Not only is it practically straightforward to vary these factors, but it can also be done very precisely; in fact, it must be so done because the `values' are coded into programs. Difficulties arise when controlled variability is desired in the characteristics of any data which is generated intellectually. Instances are the use of (conventional) thesaural relations in manual indexing, and the judges and scales used for relevance data. Precise control of these factors is not possible in the same way as for computational factors, because they are not so easily quantifiable or, in the case of procedural factors, specifiable. In addition, alternative forms of the test collection are required for different values of these variables, involving the experimenter in considerable effort and expense. Consequently, very little experimentation has been done with such variables. It is the hope of the experimenter, whether engineer or theoretician, that results obtained in the laboratory would also be obtained in real life, should an equivalent experiment be conducted. If that were so, he would be in a position to make very precise recommendations concerning the organization of information within the system and the retrieval algorithms which would optimize performance. Unfortunately, this extrapolation is highly proble- matic. In the first place, the scale of most test collections is very different from that of operational databases. I shall not dwell on the very difficult statistical problem of extrapolation here, but refer the reader to Robertson's chapter (2) in the present volume, and to the illuminating discussion in his thesis22. It is true that some laboratory tests have used collections of the order of 10 000 documents5' 6, 36, but these are deficient in relevance information, and therefore difficult to use for retrieval tests. 9.5 Realism Other problems have to do with realism: there are aspects of real life information retrieval activities, mostly to do with user behaviour, which