<DOC> <DOCNO> IRE </DOCNO> <TITLE> Information Retrieval Experiment </TITLE> <SUBTITLE> The Smart environment for retrieval system evaluation-advantages and problem areas </SUBTITLE> <TYPE> chapter </TYPE> <PAGE CHAPTER="15" NUMBER="320"> <AUTHOR1> Gerard Salton </AUTHOR1> <PUBLISHER> Butterworth & Company </PUBLISHER> <EDITOR1> Karen Sparck Jones </EDITOR1> <COPYRIGHT MTH="" DAY="" YEAR="1981" BY="Butterworth & Company"> All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. </COPYRIGHT> <BODY> [OCRerr]L[OCRerr]() Ihc [OCRerr]n[OCRerr]ar[OCRerr] environment tor retrieval `,:ystem evalnation thcre is thus no need to choose a retrieval threshold to distinguish the retrieved trom the non-retrieved items. Instead, recall precision values can hc computed I[OCRerr]r all possible rctrieval thresholds-- that is, after retrieving ()I[OCRerr]C, two. and eventually 1? documents in decreasing order of the similarity with the query and the results can be plotted in a composite recall[OCRerr]precision graph. The experiments can then he carried out using a very small number of variable parameters such as collection size, number of queries, relevance assessments of documents with respect to queries, interpolation procedures tor calculating precision values at fixed recall intervals, and methods for averaging the results over a number of different user queries6. The Smart experiments have thus come close to achieving the conditions often assumed for ideal retrieval test environments Uhe artificial collection environment does, however, have implications about the conclusions derivable from the experiments. Thus it is difficult to obtain really believable efficiency (as opposed to effectiveness) criteria, such as response time, processing cost, and user effort needed to submit queries and to obtain results, because no obvious procedure is available for extrapol([OCRerr]iting these efficiency measures to large, operational retrieval situations. Furthermore, when a restricted number of user queries is used to evaluate retrieval effectiveness, the implicit assumption is that these queries and the corresponding users are representative of a general user population at large. For the Smart experiments, no attempts were made to generate efficiency data, and the requirements for a representative user population were met by extending the experiments to many different collections in different subject areas, and using many kinds of user queries. When two given processing methods are compared and the retrieval results for several different collections in distinct subject areas indicate that method A furnishes better retrieval output than method B, the indications are that these results reflect real differences in retrieval effectiveness. The repetition of a given experiment using several different test collections may also be useful in overcoming some of the sampling problems which arise when test collections with satisfactory statistical properties must be chosen. Furthermore, when a number of parallel results are obtained with different collections, the relatite performance of the various processing methods may be measurable reasonably securely. Ahs()1ute performance values, on the other hand, are always difficult to use. and interpret. Thus a precision performance of 0.20, indicating that one out of five retrieved documents appears relevant to the user's interests, may be acceptable when the recall is high and the number of retrieved documents is small.' on the other hand, a larger precision of 0.50 may prove unsatisfactory in practice when the number of retrieved documents becomes too large or the recall is too low. The first test results obtained with the Smart system in 1964 and early 1965 proved to be quite different from what had been expected. Invariably they showed that the more complicated linguistic methodologies which were believed essential to attain reasonable retrieval effectiveness were not useful in raising performance. In particular, the use of syntactic analysis procedures to construct syntactic content phrases, and the utilization of concept hierarchies could not be proven effective under any circumstances. The most helpful content analysis process seemed to be the extraction of weighted </BODY> </PAGE> </DOC>