CRANV1P1
ASLIB Cranfield Research Project: Factors Determining the Performance of Indexing Systems: VOLUME 1. Design, Part 1. Text
Formation of Index Languages
chapter
Cyril Cleverdon
Jack Mills
Michael Keen
Cranfield
An investigation supported by a grant to Aslib by the National Science Foundation.
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
- 61 -
used ones) accounted for 68"/0 of the indexing postings, and 30"/o accounted for 92"/o of
the postings, after which the curve flattens out.
Reduced vocabularies
Some explanation of the problem of vocabulary reductions referred to above
seems desirable. Generally speaking, all recall devices imply a smaller vocabulary
(with bigger classes), and precision devices imply a larger vocabulary (with smaller
classes). A class is enlarged by confounding two or more classes which previously
had a separate existence; contraction is the reverse process. By 'vocabulary', we
mean the total number of discrete indexing elements, lexical and syntactic (i. e. ,
substantives and relational terms) provided in an index language. It may seem sur-
prising that links are included in a statement of vocabulary size, since they are not
discrete devices in the sense that they are countable in the way lexical terms and
roles are, but vary with the number of documents indexed. However, by the funda-
mental criterion of whether they define particular classes which would not be dis-
tinguished without them, they must be regarded as part of vocabulary size.
It should be noted that vocabulary size, under normal indexing conditions, is not
necessarily a determinant of the specificity possible in an index language. This is
because increased specificity is always obtainable by coordination; e. g., if the
vocabulary contains the terms Flow and Supersonic, class Supersonic flow is specifiable
by coordinating these two terms. Theoretically it is possible to specify almost any-
thing in this way; e. g. 0 Air x Cushion x Vehicle is a simple conjunction of the separate
terms normally used to name this thing; but even where a name in no way defines the
nature of the thing it represents, it may be specified uniquely by contrived analytical
'definition' e.g., in the W. R. U. Semantic Code,Tempering is represented by Process
x Metal x Heat x (number) where the number is an arbitrary code symbol distinguishing
this particular heat process on metal from any other. Perhaps the extreme example
of the use of reduced vocabularies, with precise description resting on the various
conjunctions of a few fundamental terms was the Malvern experiment (Ref. 25).
In the case of single-term classes without coordination, however, a reduced
vocabulary can be an absolute bar on the specificity possible. If no coordination is
used, a single-term vocabulary of 1,500 specifies only half the classes specified by
a 3,000 term vocabulary. So far as testing devices is concerned, there are two dif-
ferent ways of effecting the expansion of classes. One is by an absolute reduction of
vocabulary whereby the reduction is obligatory for all searches; the other is by selec-
tive search programmes, whereby the effective reduction is permissive and may or
may not be utilized in a particular search. In the first case the reduction is measurable
(i. e., in terms of the number of discrete classes distinguishable) and in the other it is
not.
Obligatory reduction of vocabulary
Here, there is an absolute 'block reduction' (a block of classes being condensed
into one) in the number of classes recognized, and the indexer and searcher has no
option but to accept the confounding of more specific classes which is implied. This
was the case with reduction by synonym-control and by confounding of word forms. It
was also the case with the single-term hierarchies, although reduction by hierarchy
may be achieved permissively and was in fact done this way in the testing of 'concept'
hierarchies. This point is explained later on.