IRE
Information Retrieval Experiment
Retrieval system tests 1958-1978
chapter
Karen Sparck Jones
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
226 Retrieval system tests 1958-1978
Herner et al., for calculating recall, 10. Document samples were better, but
were sometimes small: for example Cranfield 2 used 200 documents in many
experiments, Newell and Goffman 210.
In the measurement of performance, the most pervasive methodological
inadequacy is the arbitrary treatment of recall, or at any rate the rather
particular interpretation given to it, without any awareness of possible bias,
Newell and Goffman, and Melton, for instance, define decall as the retrieval
of cited documents, Cranfield [OCRerr] (Warburton and Cleverdon), 1, and 1+ recall
relative to a source document, the Syntol group recall for automatic abstracts
relative to manual, and Lancaster recall relative to an independently obtained
set of relevant documents. In other cases recall is measured relative to the
pooled output of alternative searches, but then performance for an individual
language being tested depends on the character of the different pool
contributions, which may not be strictly comparable.
Individual tests moreover reveal a variety of other dubious procedures, for
instance Cohen et al. compare various link/role combinations using different
numbers of queries, and even in the in many ways model CWRU tests9
results are lumped together oddly, for example those for different indexing
languages are combined to provide performance figures for different indexing
sources.
Overall, the tests taken together can only support the broadest and most
tentative conclusions: the variation in data was vast, and the performance
measures used were not only directly incomparable, for instance where one
project uses precision another opts for specificity, but incomparable in more
subtle ways, for example in averaging technique. Moreover, as relative recall
depends on an `arbitrary' base, it can give very different results according to
base: specifically, values will be absolutely higher for languages with a similar
performance than for those with a different, but complementary1
performance.
Automatic indexing tests
As noted earlier the character of work in automatic indexing was rather
different from that done on manual indexing. The many theoretical and
computational problems involved meant that more work had to be put into
simply establishing the feasibility of procedures and prima facie plausibility
of results. There were therefore more studies of a non-evaluative kind, and
fewer evaluative ones.
This is not the place to review work on automatic indexing in detail.
Briefly, it was primarily concerned, on the one hand, with statistical methods
of identifying, by extraction from text, words representing individual
documents or sets of documents, and on the other with statistical methods of
recognizing relations between words supplying substitute or additional search
keys. The work was very much within the framework of post-coordinate
indexing, and was chiefly devoted to the examination and programming of
statistical methods. Related research was concentrated on simpler methods
of keyword selection, for example by text location, as in O'Connor's work221
or on the application of statistical techniques in assigning items from a
manual indexing vocabulary, as in Gotlieb and Kumar's test39. Closely
related ideas studied were those of term weighting and output ranking.