IRE
Information Retrieval Experiment
The Cranfield tests
chapter
Karen Sparck Jones
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
Cranfield 1 261
\¾I
scheme, neatly described as based on `literary warrant', the main problem
was in splitting terms into elementary units. Taken together, these details
illustrate both how familiar problems of indexing recur in any environment,
and how they can perhaps be handled for medium-sized systems.
The Report also gives statistical details about the indexing which are both
of interest in themselves and relevant to the indexing performance observed.
For example for the indexing subprogramme for the last 6000 documents
there were 2350 different notational elements for UDC, 2684 main
alphabetical headings, 1686 facet notational elements, and 3174 Uniterm
terms. For individual documents, variations in the numbers of terms for the
different indexers were noted, and the average numbers declined with
indexing time. For example, average postings forjournal articles were 7.6 for
UDC for 16 mm as opposed to 2.3 for 2 mm, 4.7 and 2.2 respectively for
alphabetical subject headings, 1.7 and 1.2 for facetted, and 11.0 and 6.0 for
Uniterms (Table 6).
To check the main indexing, alternative, independent (`supplementary')
index descriptions were supplied by people outside the project. These
suggested that the project indexing was well enough organized, but also
tended to show a lack of agreement in the indexing done by different people
for the same document.
In the course of the indexing a good many low-level administrative
decisions had to be taken, for example about index formats; and while the
project was, as is noted in the Report, primarily concerned with intellectual
questions, the account of the various low-level devices and procedures is very
useful in illustrating the large amount of nitty gritty involved in any
substantial test, and the need for care in these aspects of testing. These
processes incidentally provided useful information about the time taken in
clerical operations, and also emphasize the scale of this early project: 100 000
cards were punched for the Uniterm indexing, for example.
In introducing the account of the searching representing the test proper of
the four indexing systems, in Volume 2 of the Report3, Cleverdon comments
on the need for
`a method which would enable an assessment to be made of the effects of
the variables which had been built into the indexing.' [OCRerr]. 7)
It was decided that though each group of 100 documents had distinctive
characteristics, and so should be studied individually, testing over a 6000
document subprogramme set, and more specifically that of the final
subprogramme reflecting the indexer's established indexing experience,
would be sufficient. As Cleverdon notes, there was very little guidance
available on the conduct of tests, the ASTIA Uniterm test of 1953 being both
inadequately reported and evidently unsound methodologically in lacking
controls, especially with respect to relevance decisions. The organization of
the search testing was thus strongly motivated by a desire to avoid getting
`bogged down in the quagmire of arguments concerning relevancy' (p.8),
and also by the need for statistically valid results. The first requirement was
met by using questions based on source documents and searching for these,
the latter by complicated (and, it must be said, very opaquely described)
sampling procedures. A set of 1200 questions was therefore obtained by