CRANV1P1
ASLIB Cranfield Research Project: Factors Determining the Performance of Indexing Systems: VOLUME 1. Design, Part 1. Text
Additional Tests
chapter
Cyril Cleverdon
Jack Mills
Michael Keen
Cranfield
An investigation supported by a grant to Aslib by the National Science Foundation.
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
- 106 -
CHAPTER 7
Additional Tests
The first year of the project coincided with a time when a number of groups,
who had been investigating various methods of statistical association, were becoming
interested in the possibility of putting their methods to the test, and we received
some enquiries regarding the possibility of the project test collection being used as
a 'common sample'. All such groups were, of course, working with computers, so
with the agreement of the National Science Foundation, it was arranged that a tape
should be prepared of the indexing for the 1400 documents in the test collection.
This was done by I. B.M. {U. K. ) Ltd. , and an example of the printout for document
1420 is given in Appendix 6.1.
In the end, for various reasons, none of the groups in America was able to
make use of these tapes. However, in England, Drs. Roger and Karen Needham
decided to use the Cranfield collection for a test of the 'clumping' technique developed
at the Cambridge Language Research Unit. (ref. 29). Since the computer to be used
was the Atlas, it was necessary to prepare a set of paper tapes from the punched
cards. The problems involved in this are not for us to relate, but the indexing has
now been completed, and a copy of the printout for document 1420 is also given in
Appendix 6.1.
At a later stage in the project, when the results were coming through, a meeting
with Professor Salton made it clear that the research which he had been undertaking at
Harvard was basically along similar lines to the work at Cranfield, in that both groups
were concerned with comparing the performance of various index language devices,
The difference lay in the methods adopted for the clerical processes of the testing, and
the SMART programme (ref. 30) gave the flexibility of rapid testing of any set of
documents for which the necessary, relevance assessments had been made in relation
to a set of questions, so long as t[OCRerr]ese were in a subject field for which suitable
vocabularies had been prepared. The original testing of the SMART programme had
been carried out on a collection of [OCRerr]bstracts dealing with computers, and for both
groups the prospect of using the programme to test the subset of documents taken from
the Cranfield project was very attractive. For Professor Salton, it gave the opportunity
of testing his programme in a different subject area; for us it opened up a completely
new field. There would be the opportunity for directly comparing the results of the
devices being investigated at Cranfield with the similar, but more complexly calculated,
devices used at Harvard. Secondly, there was the possibility that it would assist in
solving some of the interesting problems involved in the presentation of results. The
recall-precision curves, based on a series of cutoffs, were producing at Cranfield
quite different figures from the normalised recall and normalised precision based on
the ranked output at Harvard. This was only to be expected, since the method of
calculation was so different, but it was important to be able to find how to equate the
different sets of figures. The final point of interest was that though the Harvard
searching was normally done on document abstracts, the flexibility of the SMART
programme made it practical for the searches to be carried out on both the abstracts
and the indexing which had been done at Cranfield, thus providing for the first time
a comparison between searches based on abstracts and on indexing.
A member of the Cranfield group spent a week at Harvard, and as a result of
the visit, it was arranged that a subset of the collection, consisting of 200 documents
and 42 questions, should be processed at Harvard, and that searches should be made