CRANV1P1 ASLIB Cranfield Research Project: Factors Determining the Performance of Indexing Systems: VOLUME 1. Design, Part 1. Text Indexing Procedures chapter Cyril Cleverdon Jack Mills Michael Keen Cranfield An investigation supported by a grant to Aslib by the National Science Foundation. Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. 54 - the document indexed would be of some relevance to it. It might be argued that all indexing which involves a selection of terms to describe a document {and that is all indexing - even the use of full text by a computer, with deletion of articles, prepos---i- tions, etc. ) implies weighting in that the use of a subject term indicates a reasonable probability that the document is of some relevance to questions on that subject whereas the rejection of a term indicates that the probability is low or non-existent. Weighted indexing extends the range of values from two {worth using, not worth using) to what- ever number of different weights are recognized. For example, if an index descrip- tion. contains the terms a b c d e, and weighting is assigned to each term in the scale 3. Most important terms 2. Less important terms i. Least important terms If a, b and c are now each weighted 3, while d and e are each weighted 1, the implica- tion is that The probability of the document being relevant to a question a b c is that much greater than the probability of its being relevant to a question on d e. It has already been noted that the weights given to terms could be used as the basis for measuring exhaustivity i. e. , when a figure for a high level of exhaustive indexing was required, weights could be ignored and all terms regarded; when a figure for less exhaustive indexing was required, those terms which had low weights could be ignored and treated as though they were not indexed. It should be noted, however, that the use of weighting'as a measure of exhaustivity is purely an evaluation technique and plays no part in normal indexing, for in the latter, weighting only comes in as a device when it is applied also to the question, Then a question term with a given weight will accept as a match only those index terms which have the same (or a higher) weight. This procedure inevitably alters the boundaries of the classes defined in searching and proves weighting to be an independent index language device and not just a reflection of exhaustivity. The rejection of an index term as irrelevant is now performed at the search stage, whereas exhaustivity of indexing is decid.ed, of course, at the indexing stage. For example, suppose a question containing t'erms a b c d e, and a relevant document which has been indexed with weights as a[OCRerr] b[OCRerr] d t e[OCRerr]. g2 etc. If we [OCRerr]vere simply measuring the effect of exhaustivity, we would say there was a match of four terms (a b d e) when indexing was fully exhaustive (all weights accepted). If terms with the lowest weight (1) are now ignored, the match is reduced to three terms (a, b, e); if only the highest weight of terms is accepted (i. e. the lowest level of exhaustive indexing), then the match is reduced to two terms (a b). Here we have been using weighting purely as a measure of exhaustivity of indexing, but to consider its effect as a precision device, assume that each term in the question is weighted,and that the search specification is now a[OCRerr] b[OCRerr] c2 d[OCRerr] e3. The term c is rejected because it does not appear in the index- ing; e is rejected because the search requirement is for a term with a minimum weighting of 3, whereas in the document e has been indexed with a weight of 2. This now means the class accepted has altered-to ab d, whereas with variations of exhaus- tivityit was respectively a b d e, ab eor ab. As to the problem of howto weight, two general approaches seemed possible and both were made. Firstly, for three hundred documents, weights were assigned on a quasi-statistical basis, dependent on the sections of a document in which a term ap- peared. If a term appeared in the title, the summary or abstract, the introduction, the body of the text, the conclusion, and the list of cited references, it received a