CRANV1P1
ASLIB Cranfield Research Project: Factors Determining the Performance of Indexing Systems: VOLUME 1. Design, Part 1. Text
Formation of Index Languages
chapter
Cyril Cleverdon
Jack Mills
Michael Keen
Cranfield
An investigation supported by a grant to Aslib by the National Science Foundation.
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
- 84 -
High recall was obtained only at a low level of precision, and as soon as the latter
was improved, a precipitate drop in recall ensued.
A number of contributory causes of this were suspected. The ambiguities and
inconsistencies of the language of aerodynamics suggested one. The match between
the terms of the questions and the relevant documents which was, in some cases at
least, very poor was another. The possibility of defective indexing was not thought
to be very serious in the sense that exhaustive selection of keywords and phrases
and the organization of these into concept and themes appeared to be reasonable.
But a failure to recognize fully the connectivity between the terms of the languages
so far established undoubtedly caused some of the failures.
Another possible factor was the unusual route by which the initial concept in-
dexing had been translated into the different languages. In a real-life situation,
this translation is done concurrently with the indexing itself, which is channelled
into the controlled language as the first stage. The central elements in the test
languages had so far been applied almost entirely retrospectively. Although there
appeared to be no reason why this should have affected index performance, it seemed
that validation of it as a method (by comparing it with a normally produced index)
would be useful.
One way in which improvements in performance were thought to be possible
was by putting more sophistication into the search programmes (by distinguishing
between terms of different potency, between different combinations of these, and
so on. ) It was thought that maximum discrimination and control in searching implied
the need for maximum discrimination and control in the indexing if optimum perfor-
mances were to result. Again, although it was probable that the controls effected
retrospectively were as valid as those imposed concurrently {as in indexing by a
recognized, pre-established, control language) the slight element of doubt suggested
that it would be wise to demonstrate this.
These considerations led'to a decision to set up a conventional index v. ith a
different set of connectives based on a predetermined list of terms and to compare
its operation with that of the natural language with retrospective controls already
tested. For this, the Engineers' Joint Council Thesaurus of engineering terms,
{1Ref.28) was chosen as providing an up-to-date control language in the field of
physical science and engineering, which contained clearly defined connectives grouped
in a manner allowing convenient comparison with a number of the hierarchical
searches described in the last section. A second subset of 350 documents was selec-
ted; this included the 200 documents from the first subset that was used in testing
the concept hierarchies, thus allowing direct comparison with all previous programmes.
As in the case of the simple concept languages, no reindexing was contemplated,
only another translation of the indexing done originally, since reindexing would have
introduced an immeasurable variable; but the production of the new indexing language
simulated the normal indexing situation. In this, each document is subjected first
to 'concept-analysis' when it is decided what the document is about, what its significant
terms are and how these are related in concepts and themes. This is followed by
the translation of this information into a particular index language, with pre-established
controls as to the level of specificity to be allowed and the recognition of synonyms
and of other connectives between terms and between concepts.
Production of controlled index language using E. J. C.
The main problem raised by the use of E.J.C. was due to the fact that a