IRE
Information Retrieval Experiment
Retrieval system tests 1958-1978
chapter
Karen Sparck Jones
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
236 Retrieval system tests 1958-1978
conduct for systematic comparisons. However taken together the results
show that large differences of exhaustivity do affect performance, typically
trading recall for precision. Schumacher et al's findings (assuming constant
indexing quality) show this very clearly: with increasing exhaustivity he
obtained a substantial gain in recall, with a gradual, though not enormous,
decline in precision. Thus recall relative to the full text relevant retrieved
progressed from 25 per cent for titles to 72 per cent for titles plus abstracts,
contents lists and author keys, while precision dropped from 65 to 56 per
cent. Keen found recall rose from 74.7 to 85.8 per cent, but for an increase in
median non-relevant retrieved from 18.9 to 24.4, for controlled language
document indexing on two levels of exhaustivity. Cleverdon found that for
varying natural language exhaustivity for both requests and documents,
performance ranged from 70.5 per cent recall (relative to an independent
sample) and 32.2 per cent precision to 80.6 per cent recall and 18.1 per cent
precision. However, as Sparck Jones suggests, small differences are not
important and exhaustivity in document indexing can be consciously
counterbalanced by the treatment of requests. This is indeed implicit in the
use of extended profiles for title searching in operational services. Cleverdon's
results also suggest the possibility of trade-offs, as do Aitchison et al.'s tests
of different query formulations, broad, medium or narrow.
Searching tests
The evaluation tests on searching include some of the most interesting of the
decade. It is, however, difficult to give a coherent account of them, since the
whole searching subcomponent of a retrieval system is an extremely
complicated one, and one which is not well understood, and the different
tests done have been scattered over the large area of searching as a whole.
Searching refers both to the entire interaction between a user seeking
documents relevant to a need from a document file, and to any particular
expression of this need used to scan some or all of the file. The latter includes
the treatment of individual terms and that of the logical structure of the
query, and the complex relationship between the two. This is not the place
for a detailed discussion of searching, and in the summary account which
follows its different aspects will be referred to very crudely. For this purpose
we will therefore simply use the term `strategy' for the searching process for
a query as a whole, `specification' for any individual matching prescription,
`logic' for the formal structure of such a prescription, and `formulation' for
the broad or narrow scope of a specification. With respect to logic, the great
majority of experiments and investigations have, following operational
practice, been concerned with boolean queries, and hence with the
measurement of performance for simple sets of retrieved documents.
However the idea of subsearches (especially broadening a search) naturally
allows for an ordering of output, and some approaches to indexing, notably
those involving weighting, can only be properly, or at any rate sensibly,
interpreted as generating a ranked, i.e. ordered output. (It should be
emphasized that this has nothing to do with the representation of Boolean
structure by weights, which is merely a matter of notation.) The Cranfield 2
experiments provided an ordered output, and as noted earlier, it became