Aslib Cranfield Research Project: Factors Determining the Performance of Indexing Systems: Volume 2

CRANV2 Aslib Cranfield Research Project: Factors Determining the Performance of Indexing Systems: Volume 2 Conclusions chapter Cyril Cleverdon Michael Keen Cranfield An investigation supported by a grant to Aslib by the National Science Foundation. Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. - 257 - page 14 of Volume 1) by anybody with reasonable knowledge of tile subject field. On the other hand, if the evaluation is only intended to cover a sub-system of the complete operational system, such as the index language, then there is not the same necessity of having "user relevance" decisions; in fact, such decisions could introduce an additional variable which might mitigate against the interpretation of the test results, and a set of "stated relevance" decisions could be more satisfactory. So far the argument has been concerned with the evaluation of operat- ional systems. All the tests of experimental systems have been or are being conducted in artificial, created environments. Under such circumstances, "user relevance" decisions cannot be obtained, and in the few tests so far carried out, "stated relevance" decisions of one kind or another have been used. However, in this particular project, as explained in the first Volume (pages 21 - 23) an endeavour was made to simulate "user relevance" decisions. At the same time (and contrary to what was done in Cranfield I}, we delib- erately eschewed any effort to interpret the stated needs; in all cases the search terms were based solely on the terminology of the question. Whether the original decision to simulate user relevance decisions was correct has already been considered (Vol. 1, page 114) and tentatively the conclusion was there reached that it might have assisted the interpretation of the test results if, instead, stated relevance decisions had been used. On the whole, this is a view to which we would still subscribe but for one fact. If stated relevance decisions had been used, and assuming the test results had shown the similar superiority of Single Term Natural Language, then it would have been virtually impossible to refute an argument that the results were unduly influenced by the relevance decisions. In the artificial situation, a person - orza group of persons - is presented with a search question (which may have been devised by someone else) and a set of documents (or their surTogates in the form of titles or abstracts) and told to make a series of decisions as to which documents are relevant. He can be given specific instructions, such as the type of person that he is supposed to be or the purpose for which he is supposed to require the information. Whatever such instructions he may receive, he is ultimately faced with a sequence of words which make up the question, and other sequ- ence of words which make up the documents, and by the intensity with which the words and the meaning of the question appear to match the words and the meaning of a document, he must decide that a given document is or is not relevant to a given question. In this artificial situation it seems reasonable to assume - and such experimental evidence as is available bears out the assumption - that there will be a closer direct match between the actual words of a question and a relevant document, than is the case in the natural situation of a questioner making user relevance decisions. Conversely, and just as important, there will, in the artificial situation, be a lower match between the question and a non-relevant document than will often be the case with user relevance judgements. Under such circumstances, it is highly probable that system perform- ance will be better with stated relevance decisions, than with user relevance decisions, since a source of possible error in the complete system has been eliminated. This is not an important factor in the present investigation, since the objective is not to obtain maximum performance per se, but is concerned with the comparison between the performance of different index languages. The important point is that stated relevance decisions which can