IRE Information Retrieval Experiment The Smart environment for retrieval system evaluation-advantages and problem areas chapter Gerard Salton Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. 322 The [OCRerr]mart environment for retrieval system evaluation Suffice it to say that a large number of fully automatic retrieval techniques were identified which appeared to be competitive with the more conventional, manual indexing procedures and the inverted file technologies that are conventionally used. Large-scale improvements appeared possible by using the iterative relevance feedback process to reformulate the search requests, (md no substantial deterioration results from extending the operations to alien environments such as foreign language materials. When the automatic procedures incorporated into the Smart system were compared with the manual analysis methodologies used by the Medlars retrieval service at the National Library of Medicine, it was found that the Smart indexing process based on the use of a stored thesaurus produced retrieval results approximately equivalent in terms of recall and precision to those obtainable with Medlars. Using a variety of enhancements such as the automatic relevance feedback procedure, advantages of about 30 per cent could be produced for the automatic Smart system compared to the conventional Medlars process. Those results turned out to have little immediate impact on operational information retrieval, largely because of the difficulty of rendering believable test results obtained with sample collections of a few hundred documents when the operational environments include several million items. Additional problems are posed by the enormous investments already made in the (ivailable commercial systems which make it impossible to contemplate a complete retooling of the kind involved in introducing language analysis methods based on the availability of document abstracts and new file organization methods. More fundamental complaints were also voiced about the methodologies incorporated into the Smart evaluation system. One of these concerned the necessity to utilize relevance assessments of documents with respect to queries in order to compute recall and precision values Large-scale studies were made of the relevance assessment process leading to the conclusion that relevance assessments of given documents with respect to particular queries were generally unreliable and not extendible to different system users. Hence it was argued that recall and precision values obtained by averaging the search results over 40 user queries were valid only for the 40 users whose relevance judgements were actually involved22'23 Lventually it became necessary to perform a complete study of the question by using a variety of different user populations rendering relevance assessments for the same document collections with respect to the same user queries. It then became clear that the recall-precision results could be expected to remain reasonably invariant with different user populations even though the individual assessments would differ widely24. It was found that substantial agreement existed among groups of assessors for documents retrieved early in a given search that exhibit substantial similarities with the user queries. Those documents are precisely the ones that largely control the shape of the recall-precision curve. There is little agreement for items that are less similar to the queries which therefore appear low down on the output lists; but these documents carry little importance for overall system performance. Many other objections can be raised about laboratory tests of retrieval systems: