CRANV1P1 ASLIB Cranfield Research Project: Factors Determining the Performance of Indexing Systems: VOLUME 1. Design, Part 1. Text Formation of Index Languages chapter Cyril Cleverdon Jack Mills Michael Keen Cranfield An investigation supported by a grant to Aslib by the National Science Foundation. Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. - 73 - Nevertheless, some of them were still exceedingly large and detailed - e. g., those reflecting spatial and shape characteristics. For these, and for the common cate- gories of Properties, Processes, Operations, etc. the Thesaurus and Code Dictionary FROLIC produced at the David Taylor Model Basin (Ref,28)proved very useful. Interpretation of the various word forms, etc. referred to above was assisted by the file, compiled during indexing, of synonyms, definitions, decisions, etc., by the word frequency list, and by reference to the indexing sheets of individual docu- ments where necessary. Sample excerpts from the single-term hierarchies are given in Fig. 5.3. (The complete schedules appear as Appendix 5.3.) It must be emphasised that only those terms appear which were used in the indexing of the test collection. Whilst this resulted in very detailed schedules in some areas, these still cannot be regarded as exhaustive of the terms in the particular area. Sometimes, if they did not happen to occur in the test collection, quite important terms will be missing. 2. Simple concepts In the previous section we described the establishment of index languages based entirely on single words, and indicated the limitations on the performance of synonyms and hierarchies imposed by this restriction. These limitations were ac- cepted in order to allow the examination of the performance of the different devices applied to single terms, in the absence of any element of precoordination. The next step was to accept a degree of precoordination from the outset. Examples have already been given of the sort of simple linking necessary if the meanings of some expressions in the natural language are not to be quite lost; e. g., 'Ground effect machine' must be retained as a single concept if loss of meaning is not to be suffered. The original indexing had, of course, included a statement of the 'concepts' in each document - it was in fact the first step taken in the actual pro- cedure of indexing a document. These concepts were now taken as the basis for the production of new synonym and hierarchy languages. 'Concept' languages In order to reduce the task of preparing these to reasonable proportions it was decided to take a substantial subset of the full collection of 1400 documents and to make a detailed classification schedule for all the terms appearing in it. The sub- set consisted of some 200 documents, containing all the docurqents relevant to some 40 questions. In order to make the new collection reasonably homogeneous, only aerodynamics documents were included. The performance of the index languages in this same subset was subsequently measured separately for a controlled language (based on a thesaurus) and for the 'options' investigated by G. Salton and his colleagues at the Harvard Computation Laboratory (the SMART system). Figures for the single term languages for the sub- set had already been obtained - they had simply to be extracted from the figures for the full collection. No reindexing was attempted, of course, since this would have invalidated comparisons with the single-term tests. One adjustment was made, however; the