CRANV1P1 ASLIB Cranfield Research Project: Factors Determining the Performance of Indexing Systems: VOLUME 1. Design, Part 1. Text Formation of Index Languages chapter Cyril Cleverdon Jack Mills Michael Keen Cranfield An investigation supported by a grant to Aslib by the National Science Foundation. Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. - 84 - High recall was obtained only at a low level of precision, and as soon as the latter was improved, a precipitate drop in recall ensued. A number of contributory causes of this were suspected. The ambiguities and inconsistencies of the language of aerodynamics suggested one. The match between the terms of the questions and the relevant documents which was, in some cases at least, very poor was another. The possibility of defective indexing was not thought to be very serious in the sense that exhaustive selection of keywords and phrases and the organization of these into concept and themes appeared to be reasonable. But a failure to recognize fully the connectivity between the terms of the languages so far established undoubtedly caused some of the failures. Another possible factor was the unusual route by which the initial concept in- dexing had been translated into the different languages. In a real-life situation, this translation is done concurrently with the indexing itself, which is channelled into the controlled language as the first stage. The central elements in the test languages had so far been applied almost entirely retrospectively. Although there appeared to be no reason why this should have affected index performance, it seemed that validation of it as a method (by comparing it with a normally produced index) would be useful. One way in which improvements in performance were thought to be possible was by putting more sophistication into the search programmes (by distinguishing between terms of different potency, between different combinations of these, and so on. ) It was thought that maximum discrimination and control in searching implied the need for maximum discrimination and control in the indexing if optimum perfor- mances were to result. Again, although it was probable that the controls effected retrospectively were as valid as those imposed concurrently {as in indexing by a recognized, pre-established, control language) the slight element of doubt suggested that it would be wise to demonstrate this. These considerations led'to a decision to set up a conventional index v. ith a different set of connectives based on a predetermined list of terms and to compare its operation with that of the natural language with retrospective controls already tested. For this, the Engineers' Joint Council Thesaurus of engineering terms, {1Ref.28) was chosen as providing an up-to-date control language in the field of physical science and engineering, which contained clearly defined connectives grouped in a manner allowing convenient comparison with a number of the hierarchical searches described in the last section. A second subset of 350 documents was selec- ted; this included the 200 documents from the first subset that was used in testing the concept hierarchies, thus allowing direct comparison with all previous programmes. As in the case of the simple concept languages, no reindexing was contemplated, only another translation of the indexing done originally, since reindexing would have introduced an immeasurable variable; but the production of the new indexing language simulated the normal indexing situation. In this, each document is subjected first to 'concept-analysis' when it is decided what the document is about, what its significant terms are and how these are related in concepts and themes. This is followed by the translation of this information into a particular index language, with pre-established controls as to the level of specificity to be allowed and the recognition of synonyms and of other connectives between terms and between concepts. Production of controlled index language using E. J. C. The main problem raised by the use of E.J.C. was due to the fact that a