CRANV1P1
ASLIB Cranfield Research Project: Factors Determining the Performance of Indexing Systems: VOLUME 1. Design, Part 1. Text
Test Design
chapter
Cyril Cleverdon
Jack Mills
Michael Keen
Cranfield
An investigation supported by a grant to Aslib by the National Science Foundation.
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
16
in relation to the number of documents in the collection than had previously been
used, and the decision was to aim at 1,200 documents with 300 questions.
There was no readily available collection of questions which had actually been
used on some previous occasion. Even if there had been, it would not have been
possible to have the originators of the questions check the documents for relevance.
The method adopted, therefore, to obtain the documents and the questions was to
select a number of recently published research papers, mainly dealing with high
speed aerodynamics, but about 20% of which covered aircraft structures. The author
of each paper was to be requested to provide the basic problem, in the form of a
search question, which had been the reason for the research being undertaken, and
also to give some additional problems which had arisen in the course of his work.
At the same time he would be asked to state which papers in his list of references
were relevant to the various questions he had provided. It was intended that the docu-
ment collection would be made up of the papers that had been included as references.
'Relevance' is obviously a matter of degree. The problem in arranging for rele-
vance assessments to be made is to decide how many degrees of relevance can be
consistently recognised. In the test of the index of Western Reserve University, two
levels of relevance were used; previously . Swanson (ref. 18) had attempted ten
levels. The decision in this test was to use four levels of relevance; details of this
and the whole procedure of obtaining the questions and document collection are given
in chapter 3.
The references in any given paper might be expected to give a high proportion
of relevant documents to any question arising in connection with that paper, but at
the same time there was the probability that other documents in the test collection
would also be relevant. The author might have known about these documents but
have decided not to use them. Alternatively, he might not have been aware of
their existence; possibly they might have been published after he had finished his
work. While it was essential that ther should be a complete cross-check of every
document and of every question, it was impracticable to send 1,200 documents to
each of 200 or so authors for them to make the assessments individually, so a
screening process was first necessary. This was to be done by recruiting a num-
ber of postgraduate students who would (hopefully) be able to eliminate most of the
non-relevant documents for each question. Then it would only be necessary to send
.to each author those papers which had a reasonable possibility of being relevant,
for each author to make a final decision concerning relevance.
We would forestall criticism of the method outlined above, by admitting immediately
that it includes nothing which overcomes the basic problems of the meaning and deter-
mination of relevance. No-one is more aware that relevance is a shifting notion, certainly
between individuals and often for the same individuals at different times. Is there, then,
justificiation for the comments by Taube that any attempt to measure system performance
is useless, since such measurement must be based on relevance decisions. We would
strongly argue against this, for it is the very situation which an information retrieval
system has to face. Users do ask questions and then accept or reject the search output
in what might seem an arbitrary manner. The objective of the methods used in this test
was to get as near as is possible in an experimental test to a true life situation in
relation to relevance decisions. While they certainly represented an advance on the
methods in Cranfield I, it is not intended to suggest that the design was perfect; again it
is necessary to go back to the time when the test was designed, and say that in 1961 it
appeared to be the best technique that could be adopted for the particular requirements.
The experience of this test has shown not only its advantages, but also some disadvantages,
and these are briefly discussed in Chapter 8.