IRE Information Retrieval Experiment An experiment: search strategy variations in SDI profiles chapter Lynn Evans Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. 300 An experiment: se[OCRerr]rch strategy variations in SDI profiles the performance measures based on relevance. In particular the measures first quantified by Cleverdon as recall ratio (proportion retrieved of the total number of relevant documents in the collection), precision or relevance ratio (proportion relevant of the number of retrieved documents), and fallout ratio (proportion retrieved of the total number of non-relevant documents in the collection) have been increasingly questioned. Despite the reservations that have been expressed about them and despite the fact that theoretically more rigorous alternative measures may have been suggested, it was considered that recall and precision were still the most useful and usable measures of retrieval performance. They are easily understood and do provide answers to two of the most important questions asked of bibliographic retrieval services, viz. `What proportion of the relevant documents have been retrieved?' and `What proportion of the documents retrieved are relevant?'. The recall figures established in this experiment were strictly measures of the relative (or matched) recall rather than the true recall. Matched recall is the percentage retrieved (by a particular strategy) of the total relevant documents found by all the searches for a query. With document collections averaging about 2500 per run and with a `non-captive' user group it was not considered sensible to try and obtain relevance assessments for all the documents actually searched. However, given the very loose filtering process utilized in the experiment to control the number of notifications sent to the user for assessment (see p.297 above), it is probable that the recall values obtained were quite close to the true recall. In the main experiment 10 search strategies were being evaluated. Of these, 8 produced a ranked output of (effectively) unlimited size, thus allowing free choice of cutoff points at which the relative retrieval performances could be compared. The other 2 strategies B and BW, being boolean type, produced a strictly limited output which of course varied from user to user depending on the subject interest covered. The problem was that there did not seem to be any basis for comparing the boolean-type strategies with the rest other than at one point, viz. the number of items retrieved by the boolean. It was decided to make two types of comparison: (1) a comparison involving all the non-boolean strategies, based on Salton's rank-order cutoff-point procedure14, and (2) a profile-by-profile comparison, involving all the strategies, in which the basis for comparison was the boolean output. Ranked-output comparison The raw retrieval data included in the original report need not be reproduced here. Three of the eight search runs were evaluated, viz. runs 1, 5 and 6. The consistency in the relative retrieval performances of the search strategies over these three runs indicated that analysis of the remaining runs was unlikely to yield any different information. For runs 1(46 queries), 5 (45 queries) and 6 (46 queries) the cumulative totals of relevant documents retrieved by the different search strategies were aggregated at the following 9 ranked-output positions: 5, 10, 15, 20, 25, 30, 35, 45 and 55 notifications. The corresponding recall and precision figures