CRANV1P1
ASLIB Cranfield Research Project: Factors Determining the Performance of Indexing Systems: VOLUME 1. Design, Part 1. Text
Indexing Procedures
chapter
Cyril Cleverdon
Jack Mills
Michael Keen
Cranfield
An investigation supported by a grant to Aslib by the National Science Foundation.
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
- 40 -
CHAPTER 4
Indexing Procedures
The function of indexing in libraries and information retrieval systems is to
indicate the whereabouts or absence of items relevant to a request. It is essentially
a time-saving mechanism. Theoretically, we can always find the relevant items by
an exhaustive search through the whole collection (assuming that we can recognize
what is relevant when we see it). Since this is economically impossible, the size
of the store to be examined is reduced by ciassification, using this term in its very
broadest sense, i.e., as the recognition of useful similarities between documents
and the establishment of useful document groups based on these similarities. So
documents, or document surrogates, are assigned to a limited number of classes
according to certain criteria, in particular, their subject content (although in machine
indexing, utilizing complete text scanning, this 'limited number' can become very
large - as large as the number of significant words used in the text). Search for
relevant items is made via these classes (which are classes of documents); only
those with a probability of containing relevant items are examined, and the rest
(hopefuLly the vast majority) are ignored. Clearly, we need to know as much as
possible of the nature of the classes to be recognised, and the degree to which they
allow reliable predictions to be made as to the probability of relevant items being
included in them.
Most library indexes, other than those to imaginative works (novels, music
scores, etc.) are aimed ultimately at the retrieval of subject information. Even
the great Author-Title catalogues, on which so much care has been lavished, serve
for the most part the function of a diagnostic classification, i.e., an author's works
are sought in the first piace because they are about a certain subject and his name
is a clue to locating it. The popularity of the author-title catalogue rests partly
on its precision in retrieval. Classes determined by authorship or title are mutually
exclusive; there is almost no overlapping, no ambiguity about them and requests can
be met with 100% recall and precision. But they are useless if the author or title
is not known and it is this situation with which IN is mainly concerned. So the
classes investigated by this project are those designed for searching by subject
prescription only.
There is one exception to this. Bibliographic coupling (including Citation in-
dexing) establishes classes for much the same reason as author-title catalogues,
as an oblique way of getting at subject content. Papers which have cited item x are
assumed to have some connection with the subject of x. This particular device is
dealt with separately in Chapter 7.
The terms which are used to express a request or a search prescription rarely
coincide exactly with the terms used to describe a particular relevant document;
this is likely to happen only at a relatively broad level, when a request may be ans-
wered by a treatise or monograph on the subject. For example, in the test Q. 93
read 'What investigations have been made on the flow field about a body moving
through a ratified, partially ionized gas in the presence of a magnetic field. [OCRerr] Two
documents relevant to this question were
1296 'Waves through gases at pressures small compared with magnetic pressure,
1446 'Waves of a satellite traversing the atmosphere'
In both cases the match is very imperfect. It is made only by recognising that the