IRE Information Retrieval Experiment Laboratory tests: automatic systems chapter Robert N. Oddy Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. 168 Laboratory tests: automatic Systems I I cannot easily be mirrored in laboratory Systems, but which may have considerable impact on perceived performance. I should like to point out that these problems fall into two categories, although the categories are subtly interrelated and I shall be forced to discuss them together: some relate to what may be called parameters (environmental factors, system design features, charging algorithms, for instance) and their effect upon the retrieval effectiveness obtainable, and others relate to the goals of the user and the system and how effectiveness should be measured. The debate on what comprises the effectiveness of an information retrieval system is long and involved. Notable contributions have been made by Cleverdon14' 37, Lancaster38, Cooper39 and a number of others. Van Rusbergen12 restricts the term `effectiveness' to refer to `the ability of the system to retrieve relevant documents while at the same time holding back non-relevant ones' (p.145), and it is this type of effectiveness, and only this type, that is measured by very nearly all laboratory tests of automatic systems (one exception is a test by Oddy33 of a browsing mechanism in which measurements related to user effort were made). Relevance-based effectiveness measures are also used inter a/ia in real life experiments. Now, in order to establish a fruitful relationship between the laboratory tests and their hypothetical real life analogues, we must ask two questions: (1) Is relevance-based effectiveness safely separable from other performance characteristics for experimental purposes? (2) Is relevance in real life the same as relevance in laboratory tests? Aspects of performance which may be regarded as important by users include the effort that they must expend, the response speed of the system, and the cost-effectiveness'4' 40, 4'. If a system is poor in any of these respects then, clearly, its achievements in the recallI precision domain may simply not be appreciated by the users. However, I think the connection between the different components of performance is deeper than that. System parameters such as the types and powers of storage devices, computer processors, and communication equipment, the complexity of algorithms, the ergonomics of terminal design, and the user interface facilities42 are all factors which strongly influence performance and which are not usually investigated in information retrieval tests. The assumption made is that the relevance of a document to a query does not depend on such aspects of performance. Relevance in tests is a simple abstract entity, a relation between queries and documents: any links between its real life correlate and characteristics like user effort and response time are disregarded. Of course, such links do exist and they are complex, and have yet to be investigated properly. They arise out of the cognitive activity of the user during the searching process. The user will normally be trying to fulfil some purpose, which will determine the use he makes of the system's output. His progress towards his objective, and thus his attitudes towards the search output will vary as the search itself proceeds (of which, more will be said presently). Therefore, we must expect every apparent aspect of system behaviour to have some influence on relevance- based effectiveness measurements. I am aware of no experiment which attempts to quantify any of this class of effects, although the effects are widely acknowledged43, so I am unable to answer question (1), above. I have said that in laboratory tests, simple abstractions of the phenomenon q I il j