IRE
Information Retrieval Experiment
Retrieval system tests 1958-1978
chapter
Karen Sparck Jones
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
222 Retrieval system tests 1958-1978
like Cranfield 1 or Cranfield 2 could be described as neutral with respect to
specific type of language, while assuming that some sophisticated language
was required. (In Cranfield 2 simple natural language was taken as a base for
improvement in sophistication by the application of various devices.)
In form, these tests had much in common, perhaps not altogether
surprisingly in view of the influence of Cranfield both organizationally, as in
the link with CW RU, and intellectually, as in the application of Cranfield
methods in Herner et al.'s Bureau of Ships test. The data for the tests
generally consisted of less than 100 requests, and some hundred or a few
thousand documents. Lancaster's Medlars test was quite exceptional in scale
with over 300 requests and over half a million documents. Otherwise only
Cranfield 1, Sinnett, and Cohen et al. used more than 5000 documents
(though several experimental reports do not indicate how many documents
were used). Queries varied, in some cases being genuine user queries, in
others pseudo queries, and in yet others, following Cranfield 1, ones
specifically based on a source document. On the whole there does not seem
to have been much negotiation about the query with the user. Relevance
assessments were usually made by requesters, and were normally of search
output, perhaps pooled from several alternative searches; one or perhaps two
grades of relevance were typical. Evaluation of any particular match output
was commonly by precision, and by recall (or sensitivity as CWRU called it),
though the CWRU tests substituted specificity for precision, and Sinnett
noise. Blagden used noise alone, while Lancaster added novelty to recall and
precision. In some cases simple numbers of relevant and non-relevant
documents retrieved were used. CWRU combined sensitivity and specificity
in a single measure of effectiveness. Recall was normally calculated relative
to some subset of the possible relevant documents, say those identified by
assessing some or all of the pooled output of alternative searches, or by
assessing an independently obtained collection subset. With very few
exceptions, like most of the Cranfield 2 performance characterizations,
performance was calculated for simple sets of retrieved documents, giving
one figure for each measure and so, for example, a pair of precision and recall
values, for each particular test option.
With respect to the test results, again looking at the tests substantively
rather than methodologically, the most striking feature of the actual findings
for the comparative tests was the very wide variation in performance. This
is true both of individual studies, and, insofar as such cross comparisons are
legitimate, of groups of similar tests. Variations in the findings obtained by
different projects have to be treated with reserve, since they may be attributed
as much to specific measurement procedures or data statistics as to the
system factors, especially languages and their application, being studied, or
to their environment. In particular, variations in relative recall for different
projects, especially those using pooled output, are only of real significance
within the context of individual projects.
The fact that even the more plausibly grouped tests may differ in detail, for
example by using average of numbers rather than average of ratios, or in
using an external sample rather than pooled output for relative recall, and
that in addition I have worked some figures myself, might suggest that there
is no point in giving specific findings. But this is worth doing, to give the real
flavour of the tests. It should however be noted that as many tests consisted