CRANV2 Aslib Cranfield Research Project: Factors Determining the Performance of Indexing Systems: Volume 2 Methods for presentation of results chapter Cyril Cleverdon Michael Keen Cranfield An investigation supported by a grant to Aslib by the National Science Foundation. Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. - 43 - G (Generality Number) = lO00(a + c) N Thus equation (i) shows how, given the fallout and precision ratios together with the generality number, the recall ratio can be determined by calculation, and the other three equations show the other combinations possible. Because of this relationship, it has been possible to prepare, by computer, the figures for a series of situations where the generality number ranges from 1 - 50, recall from 5%0 to I00%0 and precision from 0.5%o to 100%0. In Appendix 3.3 is given this full set of tables for F (fallout) at varying generality numbers. From this set of tables, it is possible to plot on a recall/precision graph, the curves for fallout, or on a recall/fallout plot the curves for precision at all levels for any given generality number. For the example being considered, Fig. 3.6P shows the former, while Fig. 3.7P shows the precision curves on a recall/ fallout graph. From either of these graphs it can be seen, for instance, that for search Y (the dotted line) at a recall ratio of 40% , precision ratio was 20% and the fallout ratio 0.8[OCRerr]e. As the generality number for this set of searches is 5, the above figures can be confirmed from the sheet in Appendix 3.:[OCRerr]˘Ifor generality number 5. In the column for recall of 40[OCRerr]0 and in the line for precision of 20%0, fallout is 0. 803%. In a large number of situations arising in this test, comparison is made between various systems where everything is being held constant with one exception such as, for instance, the index language. In these circumstances the generality number remains constant and therefore the fallout measure does not contribute to the presentation of the results. In spite of the fact that there are some situations where comparative results are presented when the testing has been done on collections of different sizes, (with therefore, different generality numbers), the decision has been taken, as previously stated, to present the main sets of results on recall/precision graphs. The positive reason for doing this is that discussions with a number of people have led to the conclusion that such a graph can be more readily understood than a recall/fallout graph in that it more closely reflects the required performance aspects of a system. This may, of course, be due to the fact that recall/fallout graphs are unfamiliar compared with recall/precision graphs, and our decision is certainly not intended to imply that the latter are, in experimental work, basically superior to recall/fallout graphs. In the course of this project, we have also considered a number of 'composite' measures which have been suggested. Swets (Ref. 4) argued that twin variable measures (e.g. recall/precision) were 'an unnecessarily weak procedure', but qualified this by assuming that a real retrieval system has a constant effectiveness, independent of the various forms of queries it will handle. He admitted that such an assumption is open to question, and it is clearly incorrect in an experimental situation where major varmbles are being changed with the result that new systems are being formed. In such tests, the twin variables are necessary to see the