IRE
Information Retrieval Experiment
An experiment: search strategy variations in SDI profiles
chapter
Lynn Evans
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
300 An experiment: se[OCRerr]rch strategy variations in SDI profiles
the performance measures based on relevance. In particular the measures
first quantified by Cleverdon as recall ratio (proportion retrieved of the total
number of relevant documents in the collection), precision or relevance ratio
(proportion relevant of the number of retrieved documents), and fallout ratio
(proportion retrieved of the total number of non-relevant documents in the
collection) have been increasingly questioned.
Despite the reservations that have been expressed about them and despite
the fact that theoretically more rigorous alternative measures may have been
suggested, it was considered that recall and precision were still the most
useful and usable measures of retrieval performance. They are easily
understood and do provide answers to two of the most important questions
asked of bibliographic retrieval services, viz. `What proportion of the
relevant documents have been retrieved?' and `What proportion of the
documents retrieved are relevant?'.
The recall figures established in this experiment were strictly measures of
the relative (or matched) recall rather than the true recall. Matched recall is
the percentage retrieved (by a particular strategy) of the total relevant
documents found by all the searches for a query. With document collections
averaging about 2500 per run and with a `non-captive' user group it was not
considered sensible to try and obtain relevance assessments for all the
documents actually searched. However, given the very loose filtering process
utilized in the experiment to control the number of notifications sent to the
user for assessment (see p.297 above), it is probable that the recall values
obtained were quite close to the true recall.
In the main experiment 10 search strategies were being evaluated. Of
these, 8 produced a ranked output of (effectively) unlimited size, thus
allowing free choice of cutoff points at which the relative retrieval
performances could be compared. The other 2 strategies B and BW, being
boolean type, produced a strictly limited output which of course varied from
user to user depending on the subject interest covered. The problem was that
there did not seem to be any basis for comparing the boolean-type strategies
with the rest other than at one point, viz. the number of items retrieved by
the boolean.
It was decided to make two types of comparison:
(1) a comparison involving all the non-boolean strategies, based on Salton's
rank-order cutoff-point procedure14, and
(2) a profile-by-profile comparison, involving all the strategies, in which the
basis for comparison was the boolean output.
Ranked-output comparison
The raw retrieval data included in the original report need not be reproduced
here. Three of the eight search runs were evaluated, viz. runs 1, 5 and 6. The
consistency in the relative retrieval performances of the search strategies
over these three runs indicated that analysis of the remaining runs was
unlikely to yield any different information.
For runs 1(46 queries), 5 (45 queries) and 6 (46 queries) the cumulative
totals of relevant documents retrieved by the different search strategies were
aggregated at the following 9 ranked-output positions: 5, 10, 15, 20, 25, 30,
35, 45 and 55 notifications. The corresponding recall and precision figures