Aslib Cranfield Research Project: Factors Determining the Performance of Indexing Systems: Volume 2

CRANV2 Aslib Cranfield Research Project: Factors Determining the Performance of Indexing Systems: Volume 2 Conclusions chapter Cyril Cleverdon Michael Keen Cranfield An investigation supported by a grant to Aslib by the National Science Foundation. Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. - 256 - In Chapter 6, the results were given for a set of questions dealing with aircraft structures, where, it has been earlier suggested, the subject language is less mushy. The results are not easy to interpret, but it appears probable that the assumption that aerodynamics represents the mushier language was unjustified. In the final chapter of' Volume 1, we said that "It would seem, that next to the question of relevance assessments, the determination of the effect of subject language precision is the most important problem to be tackled". This opinion still holds, and we find it impossible to say categor- ically that the subject area of the test collection did not have an influence on the comparative test results. Undoubtedly the size of the test collection (on which the normalised recall ratios are based) is smaller than one would have liked. The test results presented in Chapter 4, Section 1, show that the smaller sets of documents and questions were representative of the complete document collection and question set. but these tests were only concerned with [OCRerr]he Single Term index languages, and it will be necessary to await confirmation on this point from the tests being carried out using the complete collection with the SMART system. However, there appears to be no justification for suggesting that the size of the test collection could have significantly affected the comparison between systems. A matter that has already been raised in reviews of Volume 1 (e.g. Ref. 14), and will undoubtedly be argued again is the matter of relevance decisions used in this test. It was in fact considered in the earlier volume, and the reader is referred in particular to the table on page 14 of Volume I. However, since that section was written, the matter of relevance has become the object of research and investigation in its own right, and it may be worth reopening and expanding the argument in the hope that some of the complexities introduced by psychological overtones might be clarified. Consider first the matter of the evaluation of an operational information retrieval system, which we have earlier described as covering all stages from the first receipt of an enquiry to the stage of supplying the requester with the references to the set of documents (or, if the system is so designed, to an actual set of documents) which represent the system's answer to his enquiry. It is particularly stressed that the process starts with the first receipt of an enquiry. This enquiry is expressed in the form of a "stated requirement"; anyone with practical experience of information work will know that quite often the stated requirement is far removed from the real needs of the questioner. The greater the expertise of the information staff concerned, the greater the probability that it will be possible, before commencing a search, to reduce the gap between the real and stated" needs of the enquirer. However, in such a situation, namely the evaluation of an operational system, it is essential that the relevance assessments should be based on the real needs of the questioner; it therefore follows that the questioner must make the relevance judgements. Only if this is done can it be found whether there are any errors {i. e. the retrieval of non-relevant documents, or the non-retrieval of relevant documents) which are due to a failure to bridge the gap between the real and the stated needs. At the same time, however, it is necessary to determine the relevance of documents in relation to the stated needs. With these two sets of relevance judgements, it is possible to pinpoint the reasons for the failures in the complete system. These two types of relevance are called "user relevance" and "stated relevance". The former can only be decided by the questioner himself, but "stated relevance" can be determined (as has been argued in the table on