CRANV1P1
ASLIB Cranfield Research Project: Factors Determining the Performance of Indexing Systems: VOLUME 1. Design, Part 1. Text
Formation of Index Languages
chapter
Cyril Cleverdon
Jack Mills
Michael Keen
Cranfield
An investigation supported by a grant to Aslib by the National Science Foundation.
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
- 73 -
Nevertheless, some of them were still exceedingly large and detailed - e. g., those
reflecting spatial and shape characteristics. For these, and for the common cate-
gories of Properties, Processes, Operations, etc. the Thesaurus and Code Dictionary
FROLIC produced at the David Taylor Model Basin (Ref,28)proved very useful.
Interpretation of the various word forms, etc. referred to above was assisted
by the file, compiled during indexing, of synonyms, definitions, decisions, etc., by
the word frequency list, and by reference to the indexing sheets of individual docu-
ments where necessary.
Sample excerpts from the single-term hierarchies are given in Fig. 5.3. (The
complete schedules appear as Appendix 5.3.) It must be emphasised that only those
terms appear which were used in the indexing of the test collection. Whilst this
resulted in very detailed schedules in some areas, these still cannot be regarded as
exhaustive of the terms in the particular area. Sometimes, if they did not happen
to occur in the test collection, quite important terms will be missing.
2. Simple concepts
In the previous section we described the establishment of index languages
based entirely on single words, and indicated the limitations on the performance of
synonyms and hierarchies imposed by this restriction. These limitations were ac-
cepted in order to allow the examination of the performance of the different devices
applied to single terms, in the absence of any element of precoordination. The
next step was to accept a degree of precoordination from the outset.
Examples have already been given of the sort of simple linking necessary if
the meanings of some expressions in the natural language are not to be quite lost;
e. g., 'Ground effect machine' must be retained as a single concept if loss of meaning
is not to be suffered. The original indexing had, of course, included a statement of
the 'concepts' in each document - it was in fact the first step taken in the actual pro-
cedure of indexing a document. These concepts were now taken as the basis for the
production of new synonym and hierarchy languages.
'Concept' languages
In order to reduce the task of preparing these to reasonable proportions it was
decided to take a substantial subset of the full collection of 1400 documents and to
make a detailed classification schedule for all the terms appearing in it. The sub-
set consisted of some 200 documents, containing all the docurqents relevant to some
40 questions. In order to make the new collection reasonably homogeneous, only
aerodynamics documents were included.
The performance of the index languages in this same subset was subsequently
measured separately for a controlled language (based on a thesaurus) and for the
'options' investigated by G. Salton and his colleagues at the Harvard Computation
Laboratory (the SMART system). Figures for the single term languages for the sub-
set had already been obtained - they had simply to be extracted from the figures for
the full collection.
No reindexing was attempted, of course, since this would have invalidated
comparisons with the single-term tests. One adjustment was made, however; the