CRANV2
Aslib Cranfield Research Project: Factors Determining the Performance of Indexing Systems: Volume 2
Conclusions
chapter
Cyril Cleverdon
Michael Keen
Cranfield
An investigation supported by a grant to Aslib by the National Science Foundation.
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
- 254 -
Before considering some of the particularly striking aspects of the
ranked order of effectiveness as given in Fig. 5.15T, there are certain
points to be noted about this table. The normalised recall ratios range from
65.82% to 44.64% and this range encompasses some 33 different index languages
plus 14 languages (or options) of the SMART system. It is impossible to
state here what is a significant difference; most people who have been
consulted agree that anything less than 1% is probably of doubtful significance,
but that a difference of 3% or 4% almost certainly represents a significant
change in performance. Rather than try to postulate on this point, we would
prefer to rely on the consistency with which certain actions have certain
effects.
For convenience of discussion, the normalised recall table, with the
SMART results deleted, is reprinted as Fig. 8.1. It can be seen that the
Single Term index languages rank 1, 2, 3, 4, 5, 6, 7= and 12 with the
normalised recall ratio ranging from 65.82% and 61.17%. Starting from the
base of natural language (with a score of 65.00%), the use of synonyms and
word forms shows a slight improvement, whereas an enlargement of the
classes by quasi-synonyms and hierarchical grouping detracts from the
performance.
Of the six Controlled Term index languages, that using only the basic
terms gave the best performance, with a ranking of 10=and a normalised
recall ratio of 61.76%, this being a slight improvement on the lowest score
with a Single Term index language. As narrower, broader and related terms
are brought in, ranking orders for the other five Controlled Term index
languages are 10=, 15, 17, 18 and 19, with the lowest score being 59.17%.
The searches on abstracts and titles gave four languages which ranked
13, 14, 16 and 20, the range being from 60.94% to 58.94%.. The abstracts
(which included titles)seem to be marginally better than the titles on their own.
It is interesting that, with the abstracts, the confounding of word forms
results in a slightly lower sc.ore, whereas the reverse is true with the titles.
The highest rank of the Simple Concept index languages is 7=, with a
normalised recall ratio of 63.05%. Another language in this group is ranked
9, but the other thirteen Simple Concept index languages occupy the final
ranks from 21 to 33. The two Simple Concept index languages which perform
reasonably well are - surprisingly - those where the selection of additional
related terms is based not on the classification schedules but on the rotated
alphabetical index (see Vol. 1, Appendix 5.5}.
In Fig. 8.1 it is significant that Single Term Natural Language I.l.a has a
score of 65.00%, while Simple Concept Natural Language II.l.a has the lowest score
of 44.64%. There is only one difference between these two index languages.
In the former, the single terms are free; in the latter exactly the same single
terms are interfixed into concepts. Index Language II. 1 .a represents the
concept taken directly from the terminology of the document, e.g. 'conical
afterbody', 'centrifugal compressor'; Index Language 1.1 .a uses exactly the
same words, but they are broken down to the single terms, i.e. 'conical',
'afterbody', 'centrifugal,, 'compressor'. It would therefore seem that inter-
fixing is such a powerful device that it can severely depress the performance
when calculated by the normalised recall ratio. Even when one considers the
performance by coordination level cut-off, it can be seen from Fig. 4.700T
and from the composite graph in Fig. 4.715P, that the Simple Concept Natural I.an-
guage II.l.a has a very low maximum recall ratio, which is not compensated for
by a particularly good precision ratio. Because it is so relatively inefficient,
one finds that, for the Simple Concept index languages, the broadening of