CRANV2
Aslib Cranfield Research Project: Factors Determining the Performance of Indexing Systems: Volume 2
Conclusions
chapter
Cyril Cleverdon
Michael Keen
Cranfield
An investigation supported by a grant to Aslib by the National Science Foundation.
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
- 256 -
In Chapter 6, the results were given for a set of questions dealing with
aircraft structures, where, it has been earlier suggested, the subject language
is less mushy. The results are not easy to interpret, but it appears probable
that the assumption that aerodynamics represents the mushier language was
unjustified. In the final chapter of' Volume 1, we said that "It would seem,
that next to the question of relevance assessments, the determination of the
effect of subject language precision is the most important problem to be
tackled". This opinion still holds, and we find it impossible to say categor-
ically that the subject area of the test collection did not have an influence on
the comparative test results.
Undoubtedly the size of the test collection (on which the normalised
recall ratios are based) is smaller than one would have liked. The test
results presented in Chapter 4, Section 1, show that the smaller sets of
documents and questions were representative of the complete document
collection and question set. but these tests were only concerned with [OCRerr]he
Single Term index languages, and it will be necessary to await confirmation
on this point from the tests being carried out using the complete collection
with the SMART system. However, there appears to be no justification for
suggesting that the size of the test collection could have significantly affected
the comparison between systems.
A matter that has already been raised in reviews of Volume 1
(e.g. Ref. 14), and will undoubtedly be argued again is the matter of
relevance decisions used in this test. It was in fact considered in the earlier
volume, and the reader is referred in particular to the table on page 14 of
Volume I. However, since that section was written, the matter of relevance
has become the object of research and investigation in its own right, and it
may be worth reopening and expanding the argument in the hope that some of
the complexities introduced by psychological overtones might be clarified.
Consider first the matter of the evaluation of an operational information
retrieval system, which we have earlier described as covering all stages from
the first receipt of an enquiry to the stage of supplying the requester with the
references to the set of documents (or, if the system is so designed, to an
actual set of documents) which represent the system's answer to his enquiry.
It is particularly stressed that the process starts with the first receipt of
an enquiry. This enquiry is expressed in the form of a "stated requirement";
anyone with practical experience of information work will know that quite
often the stated requirement is far removed from the real needs of the
questioner. The greater the expertise of the information staff concerned, the
greater the probability that it will be possible, before commencing a search,
to reduce the gap between the real and stated" needs of the enquirer.
However, in such a situation, namely the evaluation of an operational
system, it is essential that the relevance assessments should be based on the
real needs of the questioner; it therefore follows that the questioner must
make the relevance judgements. Only if this is done can it be found whether
there are any errors {i. e. the retrieval of non-relevant documents, or the
non-retrieval of relevant documents) which are due to a failure to bridge the
gap between the real and the stated needs. At the same time, however, it is
necessary to determine the relevance of documents in relation to the stated
needs. With these two sets of relevance judgements, it is possible to
pinpoint the reasons for the failures in the complete system.
These two types of relevance are called "user relevance" and "stated
relevance". The former can only be decided by the questioner himself, but
"stated relevance" can be determined (as has been argued in the table on