CRANV1P1
ASLIB Cranfield Research Project: Factors Determining the Performance of Indexing Systems: VOLUME 1. Design, Part 1. Text
Test Design
chapter
Cyril Cleverdon
Jack Mills
Michael Keen
Cranfield
An investigation supported by a grant to Aslib by the National Science Foundation.
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
- 15
The theoretical ideal is (!A)(lAa) that is the use of actual questions with a rele-
vance assessment made at the time by the questioner from complete texts. This
cannot be achieved in an experimental situation since there is no body of users who
can ask questions, nor would the experimental collection normally be of sufficient
size to justify actual searches. For this project, it was considered that the nearest
to the ideal would be the combination (2B)(IB/9+), that is questions which had been
asked, with a relevance assessment being made by the questioner who would be a
scientist, (fh-) implies that nothing less than abstracts would be used; the expec-
tation would be that full texts would also be used. The wisdom and implications of
this choice will be considered in relation to the test results. What can be stated here
is that the operational performance characteristics of the system being tested will
almost certainly change depending on the combination of questioner and relevance
assessor used, and care should be taken in interpreting figures which do not define
how they have been obtained in this respect. A few illustrations of what can happen
may help to clear up this point. In the Documentation Inc. example previously quoted.
the precision ratio of 86.5% is very high. A probable reasin is that it is based on i
the relevance assessment of a member of the information staff; when the set of docu-
ments is sent to the questioner, his relevance standards may be such that he will
grade the large majority as non-relevant, so the relevance ratio would then drop con-
siderably.
As another example, in a report of the evaluation of the EURATOM information
retrieval system (ref. 13 ),a precision ratio of 65% is given. The key to this high figure
is in the following sentence taken from the text of the paper. "Finally, the computer's
answers have to be checked, since it would be unreasonable to expect them to be 100%
complete and correct".
What has happened in this case is something rather different. The precision ratio
is not being calculated on the actual search output but on the search output after techni-
cal information staff have rejected the documents which they considered non-relevant.
A somewhat similar reason was the cause of some confusion at the NATO Advanced
Study Institute on evaluation of information retrieval systems, when Altmann, in
presenting the results of a test on the information retrieval system of the Harry
Diamond Research Laboratories (ref. 17) gave figures of 80% for precision ratio. In
this case, it appeared that the procedure was for the questioners, who were also making
the searches, to eliminate documents which, from title or abstract, appeared to be
non-relevant; this maybe gives interesting information about the ability of users to
eliminate non-relevant information on the basis of the title but, as with the EURATOM
test, gives no information at all on the performance of the system in regard to precision.
The discussion so far has been dealing with precision ratios; while there is
still considerable doubt as to the most useful way, in an experimental situation, of obtain-
ing relevance assessments, once that assessment has been made the determination of
precision ratio is a straightforward matter. The same is not, however, true of recall
ratio, because this is dependent on the number of relevant documents which have not
been retrieved. This problem was effectively side-tracked in Cranfield I by the use
of source-document questions; since this method had been ruled out for the present
test, there was only one apparent alternative, namely to look at every document in
relation to every question. This decision automatically placed a restriction on the
size of the test collection and the number of questions to be searched. This was not
considered a serious handicap, since the W.R.U. test had shown that a collection of
only one thousand documents was sufficient to provide a considerable amount of data
for analysis. There seemed to be some advantage in having a larger number of questions