CRANV1P1
ASLIB Cranfield Research Project: Factors Determining the Performance of Indexing Systems: VOLUME 1. Design, Part 1. Text
Formation of Index Languages
chapter
Cyril Cleverdon
Jack Mills
Michael Keen
Cranfield
An investigation supported by a grant to Aslib by the National Science Foundation.
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
- 58 -
CHAPTER 5
Formation of Index Languages
The indexing described in the last chapter provided a number of different index
languages: first, one consisting of single terms in the natural language of the docu-
ments indexed; second, this initial language made more precise by the recognition
of 'concepts', reflecting a first level of interfixed relations; third, a yet more precise
language recognizing a further level of interfixed relations in the form of 'themes';
fourth a language in which the relative importance of the terms was recognized (in
the form of weights). Combinations of these provided still more precise languages;
e. g., combination of the third and fourth.
Insofar as the indexing recognized substantive or lexical elements primarily
and lacked the relational, or syntactic device of role indicators, it was something
less than completely exhaustive. But apart from this, all the precision devices had
been accommodated and the next step was to establish facilities for expanding the
elementary classes and forming further index languages - i. e., to construct recall
devices. This chapter deals with this activity in relation to
1. Single terms
2. Simple concepts
3. A pre-established thesaurus
1. Single-term classes
A preliminary task was to prune the natural language indexing of certain minor
inconsistencies and variants which had inevitably crept in and which were not in
themselves regarded as sufficiently serious methods of defining classes to warrant
separate measurement. These initial controls involved the following:
(1) Singular and plural forms were confounded;
(2) American and English and other variant spellings were confounded; e.g. gage and
gauge, fiber and fibre, Von Karman and Karman.
(3) Certain qualifiers of terms (affixes, hyphenated-forms which were sometimes
separated, etc.) were disregarded; e.g., built-up, pitch-up, rolled-up, etc. were
treated as built, pitch, rolled; ellipse-like, jetlike, etc. were treated as ellipse, jet.
(4) Numbers as qualifiers were separated and treated as separate terms; e.g. Mach 6
became 'Mach' and '6', N. P.L. 18 x 4 (a wind tunnel) became 'N. P. L.' and '18 x 4'.
Table 5.1 gives the basic data regarding the number of single terms and their frequency
of use after the above preliminary controls had been imposed. The full set of indexing
terms is given in Appendix 5.1
Salient points are: for a collection of 1,400 documents the total vocabulary
was 3,094 terms, with reductions to 2,668 and 1,816 for the less exhaustive
vocabularies, (the reduction being based on the weights assigned to each term).
The average number of terms used to index a document was 31.3, reduced to
25.2 and 12.9 respectively for the less exhaustive vocabularies. (A discussion
of the problem of reduced vocabularies appears below).
&s to the use of different terms, whilst the average number of times a term was
used was 14.2 this is not a very significant figure in view of the wide scatter. Of the
3,094 terms, 1,169 were used only once; one term (Flow) was used 942 times, another
(Pressure) 720. The distribution curve for word-use is shown in Table 5.2 where it
is compared with three other indexes, with larger vocabularies. It can be seen that
the distribution behaves as expected in view of the fact that it reflects a [OCRerr],aller
vocabulary than the other three. In fact, the frequency of use proved to be remarkably
consistent with the well-known Zipf distribution of words according to their frequency
of use in natural language texts. It will be seen that some 10% of the terms (the n[OCRerr]¢,st