CRANV2
Aslib Cranfield Research Project: Factors Determining the Performance of Indexing Systems: Volume 2
Simulated ranking and document output cut-off
chapter
Cyril Cleverdon
Michael Keen
Cranfield
An investigation supported by a grant to Aslib by the National Science Foundation.
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
- 215 -
.normalised recall ratio is shown for each index language hy Search E and
Search A. It wil1 be seen that there is an improvement with each language
of from 1 to 2 points.
Fig. 5.22T shows the ranking score sheet for Index Language I.l.a.
with the 42 questions on the 200 document collection, but with the lowest
level of exhaustivity of indexing. Fig. 5.23P compares these results with those
obtained under similar conditions except that exhaustivity was at its highest
level {as Fig. 5.3T).
Four grades of document relevance were used in the tests, and the
effect on performance of each of the relevance grades has been considered
in Section 6 of Chapter 4. An alternative method of scoring performance
from that so far used would be to take account of these relevance gradings
by giving each document a weighting related to its relevance grading. The
use of the document output cut-off method and normalised recall permits this
to be done in what might be considered to be a meaningful manner. A simple
form of weighting is to give a score of 4 to those documents rated relevance
1, a score of 3 for documents of relevance 2, a score of 2 for documents
of relevance 3 and a score of 1 for documents rated relevance 4. The effect
of this would be that question 119, for instance, which has two documents
{1378 and 1667) rated relevance 2 and four documents (1324,1666, 1670 and
2391) rated relevance 3 would now have a total "retrieval score's of
(2 x 3) + (4 x 2) = 14.
Referring back to Fig. 5.3T, the score sheet for this question would be
amended to show the weighting of each relevant document according to the
order in which the documents of the two levels of relevance were retrieved.
This was done for the 42 questions by Index Language I.l.a and the amended
score sheet is given as Fig. 5.24T. The recall ratio is now determined on
the total "points" score for the set of questions, which is 421. At a
document cut-off of 1, the recall ratio is therefore shown to be [OCRerr] = 14%
and the recall ratios are similarly calculated for the other sixteen cut-off
groups. The normalised recall ratio is then calculated aa being 67.12.
This procedure was repeated for five other index languages to find
whether the effect of a weighting score made any difference to their
comparative performance. As can be seen from Fig. 5.25T, there was for
each case'an increase of approximately two points in the normalised recall,
so it does not appear that this method' of weighting makes any significant
difference to the overall comparison.
The exercise was repeated using different weightings, with a score of 10
for documents rated relevance 1, a score of 5 for documents rated relevance
2, a score of 3 for documents rated relevance 3 and a score of 1 for
documents rated relevance 4. This resulted in a further small increase in
the normalised recall ratios, but made no significant difference in the
comparison between systems. It would be incorrect to state that some form
of weighting might not be useful in certain circumstances, but it would seem
that it does not have any particular value in this test.
In connection with the normalised recall ratio, it is obvious that there
is what could be considered a minimum figure which is based on the random
retrieval of the whole collection for[OCRerr] every question. For instance, the three
relevant documents'of Question 79 would, with random retrieval, be ranked
50, 100 and 150, while the seven relevant documents of Question 190 would
be ranked 25, 50, 75, 100, 125, 150 and 175. With this particular
document]question set, the normalised recall ratio based on this random