CRANV1P1
ASLIB Cranfield Research Project: Factors Determining the Performance of Indexing Systems: VOLUME 1. Design, Part 1. Text
Testing Techniques
chapter
Cyril Cleverdon
Jack Mills
Michael Keen
Cranfield
An investigation supported by a grant to Aslib by the National Science Foundation.
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
- 97 -
It can be seen that for the search term i,'low, the appropriate information which was
first posted on the sheet shown in Fig. 6.1 for documents 1931-1992 has now been
included in the second column of Fig. 6.6. The information relating to the other
starting terms would have come from similar strips. As an example, the search
sheet reveals that in document 1966 Nature did not appear, but the quasi-synonym
Property {coded K) was indexed at a weight of 7. Flow was indexed at a weight of 9.
Compressible did-not appear, but it was present in the variant word form Compressibility
(F) with a weight of 10, while Channel was indexed at a weight of 10. The remaining
three starting terms did not appear in any way in this document.
When the search sheets had been printed, the ,boards' were dismantled, the strips
sorted into order and redistributed into the beehive ready for further use with another
search question. The boards finally used were of rigid hardboard, together with
mbulldog' type clips; earlier trials with cardboard sheets and perspex covers had failed
because the strips moved out of position too easily. The time taken to mount a question
on to the boards varied with the number of starting terms, but usually took between
thirty and sixty minutes. The xeroxing and checking took ten to fifteen minutes, and
redistribution of the strips a further ten to fifteen minutes. A minority of questions
had more than eleven starting terms, and therefore needed two sets of sheets. It
was usually possible to pick two questions with quite different sets of starting terms,
so that both questions could be prepared at the same time. A system of double checking
the search sheets was used to correct any errors which occurred; these were usually
due to misfi'ling of individual strips in the re-distribution stage. While this method
might seem cumbersome, it appears to have been justified by results, since it gave the
flexibility that was required, and although expensive in man-hour[OCRerr], was relatively cheap
compared to what would have been the cost for any form of machine searches.
The end result of this exercise was that we had 361 sets of search sheets, 23 sheets
in each set, posted with all the occurrences of the terms to be used in searching each
question; there were, in fact, 361 question-indexes, and it was now possible to carry
out the first series of searches. These were performed on single terms, and
investigated three variables.
1. The recall devices of synonyms, word endings and quasi-synonyms,
tested in six aggregations {known as 'index languages').
2. The precision device of simple coordination without any linking in the
indexing, where the search rules allowed any combination of terms to be
accepted, and every level of matching to be recorded.
3. The three levels of indexing exhaustivity, indicated by the weights {5-6,
7-8 and 9-10). [OCRerr]
The six index languages investigated in the first series of tests were as follows:
.Index
Language
1
2
3
4
5
6
Natural language terms {code 1)
Natural language terms + synonyms {codes 1 and A-D)
Natural language terms + word forms {codes 1 and E-J)
Natural language terms + synonyms + word forms {codes 1, A-D and E-J)
Natural language terms + synonyms + quasi-synonyms {codes !, A-D and
K-Z)
Natural language terms + synonyms + word forms + quasi-synonyms
(codes 1. A-D, E-J andK-Z)