<DOC> <DOCNO> IRE </DOCNO> <TITLE> Information Retrieval Experiment </TITLE> <SUBTITLE> The Cranfield tests </SUBTITLE> <TYPE> chapter </TYPE> <PAGE CHAPTER="13" NUMBER="263"> <AUTHOR1> Karen Sparck Jones </AUTHOR1> <PUBLISHER> Butterworth & Company </PUBLISHER> <EDITOR1> Karen Sparck Jones </EDITOR1> <COPYRIGHT MTH="" DAY="" YEAR="1981" BY="Butterworth & Company"> All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. </COPYRIGHT> <BODY> Cranfield 1 263 `a first task was to find out exactly what was being measured, exactly what was implied when it was said that Uniterm, for instance, had an efficiency of 85% [sic].' (p.51) It was argued that this meant that searches were retrieving X per cent of all documents at least as relevant as the source. But against this it could be maintained that the relation of question and source was unnaturally close. To test the interpretation of efficiency searches were made for documents independently supplied as a bibliography for some 41 questions. Source documents were excluded and the bibliography items were assigned to three grades of relevance: as useful as the source, somewhat useful, and not in fact useful. Searches for the new relevant documents showed success rates for the highly relevant of 74 per cent, 75 per cent, 60 per cent and 75 per cent respectively for UDC, alphabetical, facet and Uniterm. Thus efficiency is reduced compared with the main test. This suggested that the operating conditions for searching were important, and specifically that the success rate in the main test would have been lower if less strategy relaxation had been permitted: as suggested for Uniterms, an inverse relationship of recall and precision (`relevance') ratios applies. Thus, the Report claims, `there is the possibility of quoting three different performance figures, those with Uniterm as an example being: 65% when all concepts are required, 85% when one less concept than the required is accepted, 97% when a single Uniterm is accepted.' (p.55) Further, `the only practical method of showing these various points is by plotting them against relevance [i.e. precision] ratio, that is the percentage of retrieved documents which have an agreed relevance.' (p.55) Then `as the recall figure (i.e. the percentage of potentially relevant documents in the collection) rises, the relevance ratio (i.e. the percentage of relevant documents amongst the total of those retrieved) must fall and conversely as the recall figure drops, so the relevance ratio will improve.' (p.55) A study of precision for 79 questions, assessing a sample of retrieved documents and extrapolating, showed precision ranging from 7 per cent for UDC via 7.5 per cent for facet and 12 per cent for Uniterms to 12.5 per cent for alphabetical, for highly relevant documents. However checks suggested that quite different figures could be derived, and, more importantly, that searching beyond the point of retrieving the source document might well retrieve more relevant documents, and so improve precision. But, as Cleverdon points out, `this somewhat tortuous analysis serves to emphasise nothing more than the extreme danger of placing too much credence on any of the figures which are not otherwise corroborated.' (p.58) He nevertheless concludes that the claim made for efficiency levels being the [OCRerr]me for all relevant documents as for source documents is probably true </BODY> </PAGE> </DOC>