<DOC> 
<DOCNO> IRE </DOCNO>         
<TITLE> Information Retrieval Experiment </TITLE>         
<SUBTITLE> The Smart environment for retrieval system evaluation-advantages and problem areas </SUBTITLE>         
<TYPE> chapter </TYPE>         
<PAGE CHAPTER="15" NUMBER="320">                   
<AUTHOR1> Gerard Salton </AUTHOR1>  
<PUBLISHER> Butterworth & Company </PUBLISHER> 
<EDITOR1> Karen Sparck Jones </EDITOR1> 
<COPYRIGHT MTH="" DAY="" YEAR="1981" BY="Butterworth & Company">   
All rights reserved.  No part of this publication may be reproduced 
or transmitted in any form or by any means, including photocopying 
and recording, without the written permission of the copyright holder, 
application for which should be addressed to the Publishers.  Such 
written permission must also be obtained before any part of this 
publication is stored in a retrieval system of any nature. 
</COPYRIGHT> 
<BODY> 
[OCRerr]L[OCRerr]()   Ihc [OCRerr]n[OCRerr]ar[OCRerr] environment tor retrieval `,:ystem evalnation


thcre is thus no need to choose a retrieval threshold to distinguish the
retrieved trom the non-retrieved items. Instead, recall precision values can
hc computed I[OCRerr]r all possible rctrieval thresholds-- that is, after retrieving
()I[OCRerr]C, two. and eventually 1? documents in decreasing order of the similarity
with the query  and the results can be plotted in a composite recall[OCRerr]precision
graph. The experiments can then he carried out using a very small number of
variable parameters such as collection size, number of queries, relevance
assessments of documents with respect to queries, interpolation procedures
tor calculating precision values at fixed recall intervals, and methods for
averaging the results over a number of different user queries6. The Smart
experiments have thus come close to achieving the conditions often assumed
for ideal retrieval test environments
     Uhe artificial collection environment does, however, have implications
about the conclusions derivable from the experiments. Thus it is difficult to
obtain really believable efficiency (as opposed to effectiveness) criteria, such
as response time, processing cost, and user effort needed to submit queries
and to obtain results, because no obvious procedure is available for
extrapol([OCRerr]iting these efficiency measures to large, operational  retrieval
situations. Furthermore, when a restricted number of user queries is used to
evaluate retrieval effectiveness, the implicit assumption is that these queries
and the corresponding users are representative of a general user population
at large.
     For the Smart experiments, no attempts were made to generate efficiency
data, and the requirements for a representative user population were met by
extending the experiments to many different collections in different subject
areas, and using many kinds of user queries. When two given processing
methods are compared and the retrieval results for several different collections
in distinct subject areas indicate that method A furnishes better retrieval
output than method B, the indications are that these results reflect real
differences in retrieval effectiveness. The repetition of a given experiment
using several different test collections may also be useful in overcoming some
of the sampling problems which arise when test collections with satisfactory
statistical properties must be chosen. Furthermore, when a number of
parallel results are obtained with different collections, the relatite performance
of the various processing methods may be measurable reasonably securely.
Ahs()1ute performance values, on the other hand, are always difficult to use.
and interpret. Thus a precision performance of 0.20, indicating that one out
of five retrieved documents appears relevant to the user's interests, may be
acceptable when the recall is high and the number of retrieved documents is
small.' on the other hand, a larger precision of 0.50 may prove unsatisfactory
in practice when the number of retrieved documents becomes too large or the
recall is too low.
     The first test results obtained with the Smart system in 1964 and early 1965
proved to be quite different from what had been expected. Invariably they
showed that the more complicated linguistic methodologies which were
believed essential to attain reasonable retrieval effectiveness were not useful
in raising performance. In particular, the use of syntactic analysis procedures
to construct syntactic content phrases, and the utilization of concept
hierarchies could not be proven effective under any circumstances. The most
helpful content analysis process seemed to be the extraction of weighted

</BODY>                  
</PAGE>                  
</DOC>