IRE
Information Retrieval Experiment
Retrieval system tests 1958-1978
chapter
Karen Sparck Jones
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
228 Retrieval system tests 1958-1978
Other tests
Outside the two groups of tests discussed were a few others concerned with
what we earlier referred to as the retrieval system core: Tague's study of thc
role of question terms in matching relevant documents is an example42
Further, supporting the evaluation experiments involving the retrieval
system core were some non-evaluative studies, usually of an investigativc
rather than experimental character, concerned with such topics as thc
character of indexing vocabularies or properties of document sets. Their
importance in automatic indexing has already been mentioned: in connection
with manual indexing such studies as those of Houston and Wall43 and
Heald44 can be mentioned.
Round these core tests we can then group studies in other more peripheral
areas. Among these are two large subgroups, of user studies and bibliometric
studies. User studies naturally began to appear accompanying the develop-
ment of novel, large, or automated systems in the 1960s, and a great many
have been carried out. Early studies were mostly based on questionnaires,
Unfortunately, as such reviewers as Menzel45 and Herner and Herne[OCRerr][OCRerr]
noted, many of these studies suffered from methodological failings like poor
sampling or the use of ill-designed questionnaires. Bibliometric studies also
became popular in the 1960s, boosted by the Science Citation Index, but
these too often exhibited methodological failings, especially in the assump-
tions made about the propriety of the clustering techniques used.
Finally, it should be noted that alongside the work discussed so far, which
was explicitly or implicitly concerned with effectiveness, went studies of
system efficiency, i.e. cost. Some of the evaluation tests already mentioned1
like van Oot et al's, included cost analyses, but other studies only of costs
were carried out in the period (see King47). The development of techniques
for conducting cost analyses is of course relevant to that of testing in general.
12.4 ConclusIon on 195[OCRerr]1968
Looking at the decade 1958-1968 as a whole, it is possible to detect some
consolidation of actual findings, and some development of testing methods
and improvement in experimental standards. The main findi[OCRerr]gs were those
mentioned earlier as conclusions to be drawn from the indexing language
tests, with the tentative rider from the automatic indexing work that the
simple indexing found competitive in the manual tests can be provided
automatically.
The main findings of the decade were strikingly exemplified by the
Cranfield 22, 3and CWRU14 15results, and are well expressed by Saracevic's
comments on the latter. Thus in his conclusion to the CWRU Report'4
Saracevic notes, as overall observations about information retrieval systems,
the importance of human factors in maintaining adequate performance (a
comment endorsed by Lancaster in calling for quality control for Medlars' 2);
the fact that system performance can nevertheless only reach a middling
level; and that an inverse relationship holds for getting relevant documents
and avoiding non-relevant ones. The inverse relation of recall and precision
was emphasized by Cleverdon, and, as Lancaster and Mills noted48, as there
is an inverse relation, one should design a system for a particular point along