IRE Information Retrieval Experiment The Smart environment for retrieval system evaluation-advantages and problem areas chapter Gerard Salton Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. Theoretical insights 323 (a) the use of recall and precision measures to evaluate retrieval systems is objectionable because the user is not interested in merely retrieving relevant items, but rather wants useful items that were previously unknown to him; (b) in an iterative feedback environment where search results obtained with earlier query formulations are used to generate improved query statements, the new formulations may retrieve items already seen by the user in an earlier search operation; this circumstance falsifies the evaluation measurements unless special precautions are taken25; (c) a number of different strategies may be used to produce evaluation measurements valid for a collection of different users: each user query may be given the same importance regardless of the number of relevant documents the user wishes to retrieve (macroevaluation); on the other hand, each relevant document may be weighted equally, so that a complete response to a query with twenty relevant items would be worth twenty times as much as the response to a query with one relevant item (microevaluation)26. I The list of evaluation problems can be extended, and in principle each objection exhibits merit. In some cases, precautions can be taken to avoid the more obvious pitfalls, and sometimes specific tests can be performed to resolve a particular question, such as the one relating to the variability of the relevance assessments obtained from different user groups. In the case of the Smart environment, many test results are available obtained under differing circumstances with document collections in diverse subject areas and widely differing user populations, and on the whole the results fall into well-defined patterns. By and large, the results do not vary between different document collections, and user groups, and the simpler, better understood methodologies generally prove more effective than more refined procedures that may be difficult to carry out in practice. The methodological objections (other than the obvious ones relating to the restricted collection sizes used in the laboratory) appear to cover second-order effects that are unlikely to invalidate the overall conclusions drawn from the experiments. 15.4 Theoretical insights The practical effects of the Smart experiments on the operations of most commercial retrieval services may have been relatively small. One can nevertheless point to a number of second-order developments in operational environments: the introduction of global retrieval evaluation measures such as normalized recall and normalized precision26; the adoption of relevance feedback-like procedures in some operational situations27; and the use of automatic document classification28 and automatic term classification methods29'30 as an enhancement of the more conventional retrieval methods. The Smart system work has been more influential in creating a new framework for examining the retrieval process. The introduction of the vector processing model, in particular, has led to a re-examination of certain well-established tenets in information and document processing. Consider, for example, the automatic indexing task. Indexing consists in