ASLIB Cranfield Research Project: Factors Determining the Performance of Indexing Systems: VOLUME 1. Design, Part 1. Text

CRANV1P1 ASLIB Cranfield Research Project: Factors Determining the Performance of Indexing Systems: VOLUME 1. Design, Part 1. Text Formation of Index Languages chapter Cyril Cleverdon Jack Mills Michael Keen Cranfield An investigation supported by a grant to Aslib by the National Science Foundation. Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. - 58 - CHAPTER 5 Formation of Index Languages The indexing described in the last chapter provided a number of different index languages: first, one consisting of single terms in the natural language of the docu- ments indexed; second, this initial language made more precise by the recognition of 'concepts', reflecting a first level of interfixed relations; third, a yet more precise language recognizing a further level of interfixed relations in the form of 'themes'; fourth a language in which the relative importance of the terms was recognized (in the form of weights). Combinations of these provided still more precise languages; e. g., combination of the third and fourth. Insofar as the indexing recognized substantive or lexical elements primarily and lacked the relational, or syntactic device of role indicators, it was something less than completely exhaustive. But apart from this, all the precision devices had been accommodated and the next step was to establish facilities for expanding the elementary classes and forming further index languages - i. e., to construct recall devices. This chapter deals with this activity in relation to 1. Single terms 2. Simple concepts 3. A pre-established thesaurus 1. Single-term classes A preliminary task was to prune the natural language indexing of certain minor inconsistencies and variants which had inevitably crept in and which were not in themselves regarded as sufficiently serious methods of defining classes to warrant separate measurement. These initial controls involved the following: (1) Singular and plural forms were confounded; (2) American and English and other variant spellings were confounded; e.g. gage and gauge, fiber and fibre, Von Karman and Karman. (3) Certain qualifiers of terms (affixes, hyphenated-forms which were sometimes separated, etc.) were disregarded; e.g., built-up, pitch-up, rolled-up, etc. were treated as built, pitch, rolled; ellipse-like, jetlike, etc. were treated as ellipse, jet. (4) Numbers as qualifiers were separated and treated as separate terms; e.g. Mach 6 became 'Mach' and '6', N. P.L. 18 x 4 (a wind tunnel) became 'N. P. L.' and '18 x 4'. Table 5.1 gives the basic data regarding the number of single terms and their frequency of use after the above preliminary controls had been imposed. The full set of indexing terms is given in Appendix 5.1 Salient points are: for a collection of 1,400 documents the total vocabulary was 3,094 terms, with reductions to 2,668 and 1,816 for the less exhaustive vocabularies, (the reduction being based on the weights assigned to each term). The average number of terms used to index a document was 31.3, reduced to 25.2 and 12.9 respectively for the less exhaustive vocabularies. (A discussion of the problem of reduced vocabularies appears below). &s to the use of different terms, whilst the average number of times a term was used was 14.2 this is not a very significant figure in view of the wide scatter. Of the 3,094 terms, 1,169 were used only once; one term (Flow) was used 942 times, another (Pressure) 720. The distribution curve for word-use is shown in Table 5.2 where it is compared with three other indexes, with larger vocabularies. It can be seen that the distribution behaves as expected in view of the fact that it reflects a [OCRerr],aller vocabulary than the other three. In fact, the frequency of use proved to be remarkably consistent with the well-known Zipf distribution of words according to their frequency of use in natural language texts. It will be seen that some 10% of the terms (the n[OCRerr]¢,st