Aslib Cranfield Research Project: Factors Determining the Performance of Indexing Systems: Volume 2

CRANV2 Aslib Cranfield Research Project: Factors Determining the Performance of Indexing Systems: Volume 2 Conclusions chapter Cyril Cleverdon Michael Keen Cranfield An investigation supported by a grant to Aslib by the National Science Foundation. Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. - 259 - search on abstracts. Intermediary were the three levels of indexing done by the project staff. Figure 8.2T shows the normalised recall ratios obtained in these five cases, all using natural language terms. Index Language Average No. Normalised Recall of Terms Ratio Titles 7 59.76% Level 1 Single Term Natural Language " 2 Single Term Natural 14 62.88% Language " 3 Single Term Natural 22 63.57% Language Abstracts 33 65.00% Approx 60 60.94% FIGURE 8.2T NORMALISED RECALL RATIOS FOR FIVE LEVELS OF EXHAUSTIVITY There is the possibility that the selection of terms by the indexer was more descriptive of the document content than those terms Used for the titles and the abstracts, but the main variable in these five results concerns the level of indexing exhaustivity. It would seem that while the titles were at too low a level of exhaustivity, the gradual increase in the level, up to an average of 33 terms, brought about an improvement in performance. However, the higher level of exhaustivity represented by the abstracts (probably about 60 terms per document) was too high, resulting in the retrieval of large numbers of additional non-relevant documents, so that the performance only represented a slight improvement on that obtained with titles. This hypothesis is supported by the effect with titles and abstracts of enlarging the classes by the use of word forms. With titles, where it has been shown that the level of exhaustivity is too low, the use of word forms improves the normalised recall ratio from 58.94% to 59.76%. With abstracts, however, no such improvement is noted; already there are too many terms and the use of word forms results in a fall from 60.94% to 60.82%. Admittedly this in itself cannot be considered a significant change, but taken in the context of the other results, appears to be of some import- ance. The compilat.ion of the dictionaries or schedules was done, in the main, by Mr. Jack' Mills. Although there can be few people more comp- etent in such work, there can obviously be no guarantee but [OCRerr]hat different classes in the Single Term index languages might have given an improved performance as "compared to natural language. However, i{ seems unlikely that the classes p'repared for the Simple Concept index languages could have been solely responaible for the relatively poor performance as compared to the Single Terrr{ index languages. With the Controlled Term index languages, the classes of terms were .formed on the basis of groupings given in the Thesaurus of Engineering Terms of the Engineers Joint Council, yet the use of any groupings except Narrower Terms {Index Language III. 2. a) resulted in a loss of performance. In Chapter 3, the statement was made that for any given question, the total number of postings of the search terms of that question must be equal to the total number of retrievals at the various coordination levels. To explain this po[OCRerr]t with a simple example, assume the search programme is