CRANV1P1
ASLIB Cranfield Research Project: Factors Determining the Performance of Indexing Systems: VOLUME 1. Design, Part 1. Text
Indexing Procedures
chapter
Cyril Cleverdon
Jack Mills
Michael Keen
Cranfield
An investigation supported by a grant to Aslib by the National Science Foundation.
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
54 -
the document indexed would be of some relevance to it. It might be argued that all
indexing which involves a selection of terms to describe a document {and that is all
indexing - even the use of full text by a computer, with deletion of articles, prepos---i-
tions, etc. ) implies weighting in that the use of a subject term indicates a reasonable
probability that the document is of some relevance to questions on that subject whereas
the rejection of a term indicates that the probability is low or non-existent. Weighted
indexing extends the range of values from two {worth using, not worth using) to what-
ever number of different weights are recognized. For example, if an index descrip-
tion. contains the terms a b c d e, and weighting is assigned to each term in the scale
3. Most important terms
2. Less important terms
i. Least important terms
If a, b and c are now each weighted 3, while d and e are each weighted 1, the implica-
tion is that The probability of the document being relevant to a question a b c is that
much greater than the probability of its being relevant to a question on d e.
It has already been noted that the weights given to terms could be used as the
basis for measuring exhaustivity i. e. , when a figure for a high level of exhaustive
indexing was required, weights could be ignored and all terms regarded; when a
figure for less exhaustive indexing was required, those terms which had low weights
could be ignored and treated as though they were not indexed. It should be noted,
however, that the use of weighting'as a measure of exhaustivity is purely an evaluation
technique and plays no part in normal indexing, for in the latter, weighting only
comes in as a device when it is applied also to the question, Then a question term
with a given weight will accept as a match only those index terms which have the same
(or a higher) weight. This procedure inevitably alters the boundaries of the classes
defined in searching and proves weighting to be an independent index language device
and not just a reflection of exhaustivity.
The rejection of an index term as irrelevant is now performed at the search stage,
whereas exhaustivity of indexing is decid.ed, of course, at the indexing stage. For
example, suppose a question containing t'erms a b c d e, and a relevant document which
has been indexed with weights as a[OCRerr] b[OCRerr] d t e[OCRerr]. g2 etc. If we [OCRerr]vere simply measuring the
effect of exhaustivity, we would say there was a match of four terms (a b d e) when
indexing was fully exhaustive (all weights accepted). If terms with the lowest weight
(1) are now ignored, the match is reduced to three terms (a, b, e); if only the highest
weight of terms is accepted (i. e. the lowest level of exhaustive indexing), then the
match is reduced to two terms (a b). Here we have been using weighting purely as
a measure of exhaustivity of indexing, but to consider its effect as a precision device,
assume that each term in the question is weighted,and that the search specification
is now a[OCRerr] b[OCRerr] c2 d[OCRerr] e3. The term c is rejected because it does not appear in the index-
ing; e is rejected because the search requirement is for a term with a minimum
weighting of 3, whereas in the document e has been indexed with a weight of 2. This
now means the class accepted has altered-to ab d, whereas with variations of exhaus-
tivityit was respectively a b d e, ab eor ab.
As to the problem of howto weight, two general approaches seemed possible and
both were made. Firstly, for three hundred documents, weights were assigned on a
quasi-statistical basis, dependent on the sections of a document in which a term ap-
peared. If a term appeared in the title, the summary or abstract, the introduction,
the body of the text, the conclusion, and the list of cited references, it received a