CRANV2
Aslib Cranfield Research Project: Factors Determining the Performance of Indexing Systems: Volume 2
Methods for presentation of results
chapter
Cyril Cleverdon
Michael Keen
Cranfield
An investigation supported by a grant to Aslib by the National Science Foundation.
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
-51 -
gives data on a population of 77 people, showing the numbers that were
both inoculated and not inoculated, and the numbers that were infected
and not infected. The usual purpose of such a table is to ask a question
of the kind, 'Is there really some degree of association between the eventsT'
or in this particular case, 'Is the proportion of people that were not
inoculated and became infected significantly different from the proportion
of people that were inoculated and were infected?' In this situation,
certain tests for the reality or existence of the association can be used
(e.g. the chi square test), and other tests to determine the intensityof the
association (e.g. the Q formula) can be applied. The form in which the
question is posed, and the tests of the reality of association do not fit
the retrieval case. Any question such as 'Is the proportion of relevant
documents in the retrieved set significantly different from the proportion
i,[OCRerr] the set not retrieved' does not make any sense in the retrieval situation.
In the retrieval situation it is two sets of ratios from the table that are
to be compared with one another by observing ttle relative changes in
the ratios as conditions are changed. The actual comparative proportions
do not need any test of significance. The tests of intensity of association
do reflect the situation when the retrieval case is perfect, and when it is
at its worst, and therefore provide one scale between the two extremes.
But the deficiencies of the composite measures have been noted, and no
assistance or confirmation of the twin variable measures being used seems
to be given. The conclusion is that statistics does not help at all at this
point.
Averaging sets of results
To present reliable results of performance, the figures from a set of
questions must be averaged in some way. The size of the question set
required in order to give reliable results will not be considered here,
since there are many standard statistical tests to use in order to determine
the significance level of a set of results. It is obvious that the results
of individual questions will vary considerably, and some idea of the
magnitude of this variation may be gained from Figs. 3.16P and 3.17P.
In these plots of recall/precision, the individual results from a set of
questions are plotted, where single term natural language indexing is
being tested. Fig. 3.16P shows the points that result when any three out of
a possible total of seven of the search terms in each of thirty-one questions
are demanded in 'logical product' coordination. Fig. 3.17P shows points
from thirty-five questions when the level of search terms demanded in
coordination is varied from two to seven, and the scatter is quite wide,
ranging from 11% recall at 1% precision in the bottom left corner, to
100% recall at 100% precision at the top right corner. However, a trend
is clearly present down the left side of the plot and at the bottom right
corner, with a tendency for results at a high coordination level to give
high precision and low recall, and with lower coordination levels resulting
in an inverse change. Two different methods of averaging these results,
at each of the 'coordination levels', may be used.