CRANV1P1
ASLIB Cranfield Research Project: Factors Determining the Performance of Indexing Systems: VOLUME 1. Design, Part 1. Text
Documents and Questions
chapter
Cyril Cleverdon
Jack Mills
Michael Keen
Cranfield
An investigation supported by a grant to Aslib by the National Science Foundation.
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
- 36 -
The third, and probably ,mos[OCRerr] significant reason was the greater care taken
with the questions for the W. R.U. test. There appears to be no reason to apologise
for the fact that it was not possible to exercise such close control over the question
compilers when we had to obtain some 1,600 questions for Cranfield I, but by the
time of the W. R.U. test, the importance of the matter had been accepted, and the
question compilers were personally selected and h, ore adequately instructed.
In the W. R.U. test, an analysis was made of all documents in the collection
against each question and, as given in Appendix 3C of Ref. 3, 42 other documents
were assessed as equally relevant as the source documents. As a further check on
source document questions, the titles of these documents have also been matched
against the appropriate questions, using the list of terms generated with the original
114 source documents. Fourteen documents had a single term match with the
questions, so again the recall ratio was 33%, the same as with the source documents.
This appears to show fairly conclusively that, in the W. R.U. test, there was no
unnatural relationship between the terminology of questions and source document
titles, and lends support to the strongly-held view of the Aslib-Cranfield staff that
questions based on source documents can still be considered as being, in the right
circumstances, a convenient and economic device for testing I.H. systems.
Some unnatural relationship was clearly present in Cranfield I, but it is wrong
to conclude from this that whenever there is a substantial match between question
and title, then the relationship is necessarily unnatural. Some proportion of ques-
tions in a real life situation are bound to have some relevant documents with a close
question title match, and if this is not the case then all Permuted Title or K. W. I. C.
indexes are useless. However, although as explained earlier, source-document
questions are not used in the present test, Swanson still expresses doubt and comments
on the present test method:- 'This is some improvement (since the title-question
correlation is probably diminished); but it is still dubious in principle - a 'biased'
or ,special' relationship between questions and relevant articles persists' (ref. 4).
Although no evidence is presented to justify this statement, an examination of some
of the questions and their relevant documents has been made, to find out the extent,
if it exists, of the bias of the suggested relationship.
Using 35 of the questions*, and their associated 287 relevant documents, we
first examined the correlation between the questions and document titles. The words
and phrases of the questions were examined for a 'match' with the words and phrases
in the titles, and generally an identical word or phrase only was considered as a
match, except that synonymous word ending variants were accepted. In terms of the
whole question, two levels of matching were distinguished:-
Level A Strong Match Two or more concepts, or important subject words were
demanded. A single concept was only accepted if it was one of the vital ones in the
question, and in a few cases a single word was accepted as a vital or 'key' term
provided it was used less than twenty times in indexing.
Level B Weak Match These rules accepted any match down to a single word, provided
it was a subject content word. The general descriptive words such as Problem,
System, Solution, Parameters, High, Large, etc. were not accepted.
*These questions are the 7 search-term questions and appear as Question Set 1 in
the Appendices.