CRANV1P1
ASLIB Cranfield Research Project: Factors Determining the Performance of Indexing Systems: VOLUME 1. Design, Part 1. Text
Indexing Procedures
chapter
Cyril Cleverdon
Jack Mills
Michael Keen
Cranfield
An investigation supported by a grant to Aslib by the National Science Foundation.
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
55 -
weight of ten. If it failed to appear in any one of these positions its weight was reduced
by one point for each failure to appear. Since the indexing was strictly according to the
natural language of the document this meant that all terms received some weight insofar
as no terms were used which did not appear somewhere in the document.
Secondly, for the rest of the collection, weights were assigned subjectively. In
most of the literature on weights, notably the pioneering articles by Maron and others
(Ref. 24 ), weights were assigned subjectively to individual terms. This proved un-
satisfactory; e.g. in a document in which 'Low aspect ratio wings' constitutes a cen-
tral theme, can the single terms Low or Aspect or Ratio possibly be regarded as
crucial in themselves? The significance of a term in a document is very often lost if
the term is robbed of its context. So weights were assigned to concepts and the indi-
vidual terms within a concept received the weight given to that concept.
A range of six different weights was again adopted. This time it attempted to
combine a measure of the importance of the concept in relation to the total message
of the document (which is another way of referring to the document's probable rele-
vance to a question entailing the concept) with an assessment of its significance in
retrieval terms, i.e., its potency as a retrieval handle in the particular collection
indexed. The subject significance was measured by reference to a trio of values:
these assume that a document can normally be regarded as consisting of its integrated
subject as a whole (its main theme), together with one or more subsidiary themes,
which may vary in importance from quite major component themes to quite minor,
marginal themes, This assumption is a simple extension of the analysis into themes
already described as 'partitioning'. According to the status of the theme in this scheme
it received weights between 9/10, 7[8 or 5/6:
Weights
9/10
7/8
5/6
For concepts in the main general theme of the document
For concepts in a major subsidiary theme
For concepts in a minor subsidiary theme.
When assigning the weights to the individual terms, the higher weight of the pair
assigned to the concept concerned was used if the term was considered to be a very
potent one; potency here was regarded as a mixture of word frequency in the total
collection (indicating roughlythe generality or otherwise of the request in the context
of the particular collection), and 'concreteness', whether the term was likely to be
requested as the focal point of a subject or whether it was too vague to be the object of
a direct and separate request. For example, in Document 1590 (Figure 4.1) the
concepts of Themes A and B were each allocated the top weight; the individual terms
which were considered potent (e. g. Stage, Matching, Compressor, etc.) received a
weight of 10; Flow (on the grounds that it was a very common term) Test, Data. Analysis
(on the grounds of vagueness) received the lower top weight of 9.
One small point to be m[OCRerr]ticed is that an individual term was always given the
weighting of the more heavily weighted concept if it appeared in more than one con-
cept. In the example, concept m is Idealised compressor, and the concept is given
a weight of 8. However the term Compressor has previously occurred in concept c ,
Axial flow compressor, f¢,[OCRerr] which it received a weight of 10, and this it retains. --
The fourth step in the indexing was to write out the terms individually and attach
the weights to them as explained above.