<DOC> 
<DOCNO> IRE </DOCNO>         
<TITLE> Information Retrieval Experiment </TITLE>         
<SUBTITLE> The Cranfield tests </SUBTITLE>         
<TYPE> chapter </TYPE>         
<PAGE CHAPTER="13" NUMBER="263">                   
<AUTHOR1> Karen Sparck Jones </AUTHOR1>  
<PUBLISHER> Butterworth & Company </PUBLISHER> 
<EDITOR1> Karen Sparck Jones </EDITOR1> 
<COPYRIGHT MTH="" DAY="" YEAR="1981" BY="Butterworth & Company">   
All rights reserved.  No part of this publication may be reproduced 
or transmitted in any form or by any means, including photocopying 
and recording, without the written permission of the copyright holder, 
application for which should be addressed to the Publishers.  Such 
written permission must also be obtained before any part of this 
publication is stored in a retrieval system of any nature. 
</COPYRIGHT> 
<BODY> 
                                            Cranfield 1  263

 `a first task was to find out exactly what was being measured, exactly what
 was implied when it was said that Uniterm, for instance, had an efficiency
 of 85% [sic].' (p.51)
It was argued that this meant that searches were retrieving X per cent of all
documents at least as relevant as the source. But against this it could be
maintained that the relation of question and source was unnaturally close. To
test the interpretation of efficiency searches were made for documents
independently supplied as a bibliography for some 41 questions. Source
documents were excluded and the bibliography items were assigned to three
grades of relevance: as useful as the source, somewhat useful, and not in fact
useful. Searches for the new relevant documents showed success rates for the
highly relevant of 74 per cent, 75 per cent, 60 per cent and 75 per cent
respectively for UDC, alphabetical, facet and Uniterm. Thus efficiency is
reduced compared with the main test. This suggested that the operating
conditions for searching were important, and specifically that the success
rate in the main test would have been lower if less strategy relaxation had
been permitted: as suggested for Uniterms, an inverse relationship of recall
and precision (`relevance') ratios applies. Thus, the Report claims,
 `there is the possibility of quoting three different performance figures,
 those with Uniterm as an example being:
   65% when all concepts are required,
   85% when one less concept than the required is accepted,
   97% when a single Uniterm is accepted.' (p.55)
Further,
 `the only practical method of showing these various points is by plotting
 them against relevance [i.e. precision] ratio, that is the percentage of
 retrieved documents which have an agreed relevance.' (p.55)
 Then
 `as the recall figure (i.e. the percentage of potentially relevant documents
 in the collection) rises, the relevance ratio (i.e. the percentage of relevant
 documents amongst the total of those retrieved) must fall and conversely
 as the recall figure drops, so the relevance ratio will improve.' (p.55)
A study of precision for 79 questions, assessing a sample of retrieved
documents and extrapolating, showed precision ranging from 7 per cent for
UDC via 7.5 per cent for facet and 12 per cent for Uniterms to 12.5 per cent
for alphabetical, for highly relevant documents. However checks suggested
that quite different figures could be derived, and, more importantly, that
searching beyond the point of retrieving the source document might well
retrieve more relevant documents, and so improve precision. But, as
Cleverdon points out,
 `this somewhat tortuous analysis serves to emphasise nothing more than
 the extreme danger of placing too much credence on any of the figures
 which are not otherwise corroborated.' (p.58)
He nevertheless concludes that the claim made for efficiency levels being the
[OCRerr]me for all relevant documents as for source documents is probably true

</BODY>                  
</PAGE>                  
</DOC>