IRE Information Retrieval Experiment Laboratory tests: automatic systems chapter Robert N. Oddy Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. Realism 169 of relevance are adopted for evaluation. Recently, theories for information retrieval have emerged out of the background of experimental work and they are founded upon the same abstractions. Robertson and Belkin44 have made a distinction between two principles for ranking documents in response to a query: one can rank according to probability of relevance or degree of relevance. Probabilistic theories assume that relevance is a boolean variable, that is it can take on one of two values, denoted: relevant and non-relevant. Systems based upon the use of matching functions or similarity measures, e.g. co-ordination level and cosine correlation, would appear to be estimating the degree of relevance, although evaluation is usually done with dichotomous relevance judgements. Other assumptions typically made about relevance are that, for any document-query pair, the relevance judgement is independent of time, and of the other relevance judgements. The idea of relevance in the context of real information needs is complex and poorly understood-information retrieval research can be viewed as our attempt to understand it-and has been the subject of a substantial literature, to which I refer the reader through Saracevic's excellent review article45. A document retrieval system user generally makes a series of decisions about documents. First, he may make a note (perhaps mental) of the existence of the document; then he may decide to look at the document's contents; finally, he may decide to make use of those contents in his own work. All of these decisions can be regarded as relevance judgements, and the outcome of each obviously depends upon the enquirer's perception of the document, the purposes of the enquirer, and his existing knowledge. By his perception of the document, I mean what aspects of its description or content the enquirer sees (title, abstract, index terms, for example), and in what circumstances he sees them (online or in a batch printout). The `cognitive view' of perception46 is that perceived objects are interpreted through the knowledge, or world model, of the perceiver. Online systems are often provided with so-called browsing facilities, presumably to encourage the interleaving of mechanical and intellectual effort (recommended by Doyle47, for instance). Unfortu- nately, the high cost of using today's online services discourages many users from taking the time to contribute significant intellectual effort during the search. (Let us hope that this is a temporary situation.) Nevertheless, even under these circumstances, a user's state of knowledge relating to his purpose changes during a search. The purpose itself may also undergo change if Belkin's48 analysis of the information retrieval situation is to be accepted. A user comes to an information retrieval system because his state of knowledge is, in some way, anomalous; that is, he has recognized that his mental world model cannot cope with his problem in hand. It must be assumed that he may not be able to specify what information is needed to resolve the anomaly. So, his conceptualization of his purpose in searching the literature is subject to modification as his knowledge, and thus the anomaly in his knowledge, changes. The consequence of all this is that relevance is dependent upon three factors related to the user-perception, purpose and knowledge- which are causally closely related to each other, and subject to variation in the course of an interactive search. The picture of relevance decisions that we are obtaining is very different in nature from the relevance judgements included in test collections. Thus my answer to question (2) is clearly `No'. What implication does this argument have for the results of laboratory