IRE
Information Retrieval Experiment
Retrieval system tests 1958-1978
chapter
Karen Sparck Jones
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
224 Retrieval system tests l958[OCRerr]l978
Index language test overview
The most fruitful way of looking at the results obtained in the tests of the
period is to see how the specific findings and the interpretations given them
by those concerned can be seen, when the tests are taken together, to show
common trends with implications for our understanding of information
retrieval system behaviour in general. What emerges from the tests of 1958-
1968 is unlikely to be novel to those familiar with the work, but it is worth
emphasizing the fact that these very broad conclusions are supported by the
results obtained over a range of projects, and are not simply based on single
tests.
Thus the tests comparing different languages, like Spencer's, Shaw and
Rothman's, the various Cranfield tests, and the CWRU project, show that,
other things being equal, different languages achieve comparable levels of
performance though they may retrieve different sets of documents and
especially relevant documents. A natural corollary is that fancy indexing has
no especial merit, as Cranfield [OCRerr] Montague, and Melton show, or, to put it
the other way round, that simple indexing has merit, as Blagden, Cranfield
2 and Shaw and Rothman indicate. A related conclusion is that supported by
Cranfield 1 and CWRU, that the indexing subsystem is not the overwhelm-
ingly important factor in determining system performance. In Lancaster's
and Saracevic's view, based on the Medlars and CWRU studies respectively,
the treatment of the question, and specifically its proper development,
emerges as much more important. Saracevic's general conclusion is that
human factors are the most important ones. The related system factors most
affecting language performance seem to be the exhaustivity or depth of
indexing, noted for Cranfield 1 and 2 and CWRU, and also, according to
Cranfield 2, the specificity of the indexing language.
The tests taken together indeed support the statement that there is an
inverse relationship between recall and precision, which was explicitly
studied in the Cranfield 2 experiments, which is influenced both by indexing
policies for documents or requests, determining exhaustivity, and by indexing
resources in languages, determining specificity. The more detailed studies of
the link/role test subgroup provide particular evidence here, showing that
both links and roles are precision devices, with roles especially restrictive:
and other studies of indexing with relations, like the Syntol test and Cranfield
2, show a similar restrictiveness. Thus the statement that, other things being
equal, languages perform the same has to be read as meaning that languages
perform the same if document dependent factors are held constant and the
languages are not explicitly oriented in opposite directions with respect to
recall and precision: if good levels of both recall and precision are required,
then when document variables are held constant, languages representing, for
example, rather different classificatory philosophies do not differ materially
in behaviour.
Of course, these observations can only be taken as very broad generaliza-
tions, given the great variations in the details of the tests of the period, and
also their many methodological deficiencies. The latter might indeed be
regarded as sufficiently gross in many cases to undermine any conclusions to
be drawn from the tests, but an alternative view is that the tests, however
defective, were sufficiently varied that any common result can be regarded as