IRE
Information Retrieval Experiment
Retrieval system tests 1958-1978
chapter
Karen Sparck Jones
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
244 Retrieval system tests 1958[OCRerr]19'78
Schumacher et al's did not, as noted, distinguish closely related factors likc
indexing source and quality and index description source and quality, and
Svenonius' experiment suffered from similar defects. In evaluation, Cleverdon
indulged in some not very well justified performance extrapolation. In a good
many tests it is difficult to attribute any significance to the absolute values of
relative recall obtained, and indeed, as noted in connection with searching,
perhaps to comparative values.
The most welcome feature of the decade has been the increase in test
collection size, though reports of operational tests in particular tend not to
indicate precise size. There are nevertheless far too many tests using quite
small document sets and, much more importantly, small numbers of requests.
For example Miller used only 25 requests, Cleverdon has often used less than
20, Katzer used 18, and Cameron 12. It is very doubtful whether the smaller
sets produce results which can be regarded as more than suggestive for other
contexts. The largest request sets used were UKCIS' 193 and Leggate's 160.
It is particularly regrettable that the many Smart Project tests have typically
used small collections, in recent years three consisting of some 25 requests
and 450 documents each. Several studies, like UKCIS', have unfortunately
also used different sizes of request set in individual, closely related
experiments.
To complete the detailed discussion, we may note that as in the previous
decade, the main body of evaluation tests on retrieval system core factors
discussed so far was surrounded by other studies of different types. These
have also followed the changing trends of the decade. Non-evaluation studies
in the core area include a number, especially in the earlier part of the decade,
in the area of automatic indexing, like those of Artandi and Wolf' [OCRerr] Carroll
and Roelloffs1 12, Williams' 13, Harter1 14, 115, and Field1 16, all concerned
with terms, and of Litofsky117 and Schiminovich118 on document clustering.
They all claimed some degree of plausibility or merit in the devices studied.
Outside the five groups discussed, or at least on a higher and more
comprehensive level, have been service oriented tests and studies like those
of Rowlands on 5D188, Hansen on Chemkal Abstracts costs90, and Simkins[OCRerr]'
and Pollitt92 comparing services for particular user communities. Investiga-
tions like those of Lancaster, et al.89 and Leggate et al.58, briefly mentioned
earlier in connection with searching, really fall into this category. Thesc
studies naturally reflect consumer interest in the increasing range of
competing services but incidentally, as is shown by Pollitt's study, provide
valuable raw data. In the more peripheral areas there have again been many
studies of users, and a whole range of bibliometric investigations, for example
of citation patterns. There has also been an increasing interest in data base
coverage and overlap.
12.7 The outcome of 20 years' testing
What conclusions can be drawn about the state of information retrieval
research from such a survey? More specifically, what progress has been made
over the last 20 years in obtaining substantively valuable results from
methodologically sound experiments?