ASLIB Cranfield Research Project: Factors Determining the Performance of Indexing Systems: VOLUME 1. Design, Part 1. Text

CRANV1P1 ASLIB Cranfield Research Project: Factors Determining the Performance of Indexing Systems: VOLUME 1. Design, Part 1. Text Testing Techniques chapter Cyril Cleverdon Jack Mills Michael Keen Cranfield An investigation supported by a grant to Aslib by the National Science Foundation. Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. - 97 - It can be seen that for the search term i,'low, the appropriate information which was first posted on the sheet shown in Fig. 6.1 for documents 1931-1992 has now been included in the second column of Fig. 6.6. The information relating to the other starting terms would have come from similar strips. As an example, the search sheet reveals that in document 1966 Nature did not appear, but the quasi-synonym Property {coded K) was indexed at a weight of 7. Flow was indexed at a weight of 9. Compressible did-not appear, but it was present in the variant word form Compressibility (F) with a weight of 10, while Channel was indexed at a weight of 10. The remaining three starting terms did not appear in any way in this document. When the search sheets had been printed, the ,boards' were dismantled, the strips sorted into order and redistributed into the beehive ready for further use with another search question. The boards finally used were of rigid hardboard, together with mbulldog' type clips; earlier trials with cardboard sheets and perspex covers had failed because the strips moved out of position too easily. The time taken to mount a question on to the boards varied with the number of starting terms, but usually took between thirty and sixty minutes. The xeroxing and checking took ten to fifteen minutes, and redistribution of the strips a further ten to fifteen minutes. A minority of questions had more than eleven starting terms, and therefore needed two sets of sheets. It was usually possible to pick two questions with quite different sets of starting terms, so that both questions could be prepared at the same time. A system of double checking the search sheets was used to correct any errors which occurred; these were usually due to misfi'ling of individual strips in the re-distribution stage. While this method might seem cumbersome, it appears to have been justified by results, since it gave the flexibility that was required, and although expensive in man-hour[OCRerr], was relatively cheap compared to what would have been the cost for any form of machine searches. The end result of this exercise was that we had 361 sets of search sheets, 23 sheets in each set, posted with all the occurrences of the terms to be used in searching each question; there were, in fact, 361 question-indexes, and it was now possible to carry out the first series of searches. These were performed on single terms, and investigated three variables. 1. The recall devices of synonyms, word endings and quasi-synonyms, tested in six aggregations {known as 'index languages'). 2. The precision device of simple coordination without any linking in the indexing, where the search rules allowed any combination of terms to be accepted, and every level of matching to be recorded. 3. The three levels of indexing exhaustivity, indicated by the weights {5-6, 7-8 and 9-10). [OCRerr] The six index languages investigated in the first series of tests were as follows: .Index Language 1 2 3 4 5 6 Natural language terms {code 1) Natural language terms + synonyms {codes 1 and A-D) Natural language terms + word forms {codes 1 and E-J) Natural language terms + synonyms + word forms {codes 1, A-D and E-J) Natural language terms + synonyms + quasi-synonyms {codes !, A-D and K-Z) Natural language terms + synonyms + word forms + quasi-synonyms (codes 1. A-D, E-J andK-Z)