CRANV2 Aslib Cranfield Research Project: Factors Determining the Performance of Indexing Systems: Volume 2 Methods for presentation of results chapter Cyril Cleverdon Michael Keen Cranfield An investigation supported by a grant to Aslib by the National Science Foundation. Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. -51 - gives data on a population of 77 people, showing the numbers that were both inoculated and not inoculated, and the numbers that were infected and not infected. The usual purpose of such a table is to ask a question of the kind, 'Is there really some degree of association between the eventsT' or in this particular case, 'Is the proportion of people that were not inoculated and became infected significantly different from the proportion of people that were inoculated and were infected?' In this situation, certain tests for the reality or existence of the association can be used (e.g. the chi square test), and other tests to determine the intensityof the association (e.g. the Q formula) can be applied. The form in which the question is posed, and the tests of the reality of association do not fit the retrieval case. Any question such as 'Is the proportion of relevant documents in the retrieved set significantly different from the proportion i,[OCRerr] the set not retrieved' does not make any sense in the retrieval situation. In the retrieval situation it is two sets of ratios from the table that are to be compared with one another by observing ttle relative changes in the ratios as conditions are changed. The actual comparative proportions do not need any test of significance. The tests of intensity of association do reflect the situation when the retrieval case is perfect, and when it is at its worst, and therefore provide one scale between the two extremes. But the deficiencies of the composite measures have been noted, and no assistance or confirmation of the twin variable measures being used seems to be given. The conclusion is that statistics does not help at all at this point. Averaging sets of results To present reliable results of performance, the figures from a set of questions must be averaged in some way. The size of the question set required in order to give reliable results will not be considered here, since there are many standard statistical tests to use in order to determine the significance level of a set of results. It is obvious that the results of individual questions will vary considerably, and some idea of the magnitude of this variation may be gained from Figs. 3.16P and 3.17P. In these plots of recall/precision, the individual results from a set of questions are plotted, where single term natural language indexing is being tested. Fig. 3.16P shows the points that result when any three out of a possible total of seven of the search terms in each of thirty-one questions are demanded in 'logical product' coordination. Fig. 3.17P shows points from thirty-five questions when the level of search terms demanded in coordination is varied from two to seven, and the scatter is quite wide, ranging from 11% recall at 1% precision in the bottom left corner, to 100% recall at 100% precision at the top right corner. However, a trend is clearly present down the left side of the plot and at the bottom right corner, with a tendency for results at a high coordination level to give high precision and low recall, and with lower coordination levels resulting in an inverse change. Two different methods of averaging these results, at each of the 'coordination levels', may be used.