CRANV2
Aslib Cranfield Research Project: Factors Determining the Performance of Indexing Systems: Volume 2
Methods for presentation of results
chapter
Cyril Cleverdon
Michael Keen
Cranfield
An investigation supported by a grant to Aslib by the National Science Foundation.
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
- 43 -
G (Generality Number) =
lO00(a + c)
N
Thus equation (i) shows how, given the fallout and precision ratios
together with the generality number, the recall ratio can be determined
by calculation, and the other three equations show the other combinations
possible. Because of this relationship, it has been possible to prepare,
by computer, the figures for a series of situations where the generality
number ranges from 1 - 50, recall from 5%0 to I00%0 and precision from
0.5%o to 100%0. In Appendix 3.3 is given this full set of tables for F
(fallout) at varying generality numbers. From this set of tables, it is
possible to plot on a recall/precision graph, the curves for fallout, or on
a recall/fallout plot the curves for precision at all levels for any given
generality number. For the example being considered, Fig. 3.6P shows
the former, while Fig. 3.7P shows the precision curves on a recall/
fallout graph. From either of these graphs it can be seen, for instance,
that for search Y (the dotted line) at a recall ratio of 40% , precision
ratio was 20% and the fallout ratio 0.8[OCRerr]e. As the generality number for
this set of searches is 5, the above figures can be confirmed from the
sheet in Appendix 3.:[OCRerr]˘Ifor generality number 5. In the column for recall
of 40[OCRerr]0 and in the line for precision of 20%0, fallout is 0. 803%.
In a large number of situations arising in this test, comparison
is made between various systems where everything is being held constant
with one exception such as, for instance, the index language. In these
circumstances the generality number remains constant and therefore the
fallout measure does not contribute to the presentation of the results.
In spite of the fact that there are some situations where comparative
results are presented when the testing has been done on collections of
different sizes, (with therefore, different generality numbers), the decision
has been taken, as previously stated, to present the main sets of results
on recall/precision graphs. The positive reason for doing this is that
discussions with a number of people have led to the conclusion that such
a graph can be more readily understood than a recall/fallout graph in
that it more closely reflects the required performance aspects of a
system. This may, of course, be due to the fact that recall/fallout
graphs are unfamiliar compared with recall/precision graphs, and our
decision is certainly not intended to imply that the latter are, in
experimental work, basically superior to recall/fallout graphs.
In the course of this project, we have also considered a number of
'composite' measures which have been suggested. Swets (Ref. 4) argued
that twin variable measures (e.g. recall/precision) were 'an unnecessarily
weak procedure', but qualified this by assuming that a real retrieval
system has a constant effectiveness, independent of the various forms of
queries it will handle. He admitted that such an assumption is open to
question, and it is clearly incorrect in an experimental situation where
major varmbles are being changed with the result that new systems are
being formed. In such tests, the twin variables are necessary to see the