CRANV1P1
ASLIB Cranfield Research Project: Factors Determining the Performance of Indexing Systems: VOLUME 1. Design, Part 1. Text
Comments
chapter
Cyril Cleverdon
Jack Mills
Michael Keen
Cranfield
An investigation supported by a grant to Aslib by the National Science Foundation.
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
- 114 -
scientific community - at any rate in the field of aeronautical engineering - is interested
in documentation problems, and is willing to co-operate in helping to find an answer
to these problems. The selection of the comments from the authors (given in Chapter
3} is only a sample, illustrating various points, of the many interesting and encouraging
letters which were received.
Tied in to this method of obtaining the set of documents and questions was also
the matter of obtaining relevance assessments, and here some reservations have to be
admitted concerning the method adopted. This is not to suggest that there is any
experimental evidence of there being any better or more satisfactory technique, but
rather to say that the matter of relevance assessment is, without any doubt, the most
difficult intellectual problem - in fact, one of the very few remaining problems - in
the evaluation of information retrieval systems.
In the evaluation of operational systems, there will be many occasions when the
on/y satisfactory technique will be that of using actual questions for test searches,
with the questioners assessing the relevance of the documents retrieved at the time
when the information is required. Such would be the case if it was desired, for instance,
to investigate the effect of different levels of questioner participation in the search
programme. As soon as any deviation is made from this technique of operating in a
real-life and real-time situation, a less realistic method is being used, although there
will frequently be situations where this could be justified for economic or other practical
reasons. This latter point is certainly true of an evaluation of an operational system,
and it is equally true of the test of an experimental system, where no real user group
can be said to exist. A possible weakness of the method adopted in this test lies in
the fact that the subjectivity of the relevance asses:[OCRerr]ments might have been such that
it will mask the variation in performance of the various device[OCRerr]:; which were being tested.
There is no experimental evidence of any kind at present available that makes it possible
to affirm that this is so, but the possibility is such that it requires investigation.
As stated earlier, the problem of relevance decisions is presently the most
serious in the field of evaluation, and is attracting the attention of many groups. There
is the very interesting work of Katter (ref. 33) in which a large number of people will
be asked to make 'distance' judgements between small sets of documents. In this work
the important aspect of the test design is to find which type of document surrogates
result in distance judgements which match most closely those judgements made by
assessing the complete documents. Then there is the work of Cuadra (ref. 34) where
up to one hundred individuals will be asked to assess a set of documents in the field of
information disemination, storage and retrieval. Here the attempt will be to identify
and investigate the variables which influence an individual's response, and a somewhat
similar investigation is being directed by Rees at Weston Reserve University (ref. 35).
More empirical is an investigation proposed by Cleverdon which is to be undertaken by
ASLIB. This is intended to identify the reasons why individuals reject documents which
apparently meet their requirements and alternatively why they accept, as relevant,
documents, which to a third person seem no more acceptable than those rejected. This
investigation will be carried out on some 600 individuals in twelve different organisations
and, unlike the other three projects, the relevance assessments will be made in actual
operational conditions.
However, none of these investigations into relevance apply to the problem raised
in this test. Here the situation is that a series of tests on various index languages
have been carried out, where the scoring for each test is based on the relevance decisions
of individuals simulating, as far as possible, a real life situation, with individual