IRE
Information Retrieval Experiment
Retrieval system tests 1958-1978
chapter
Karen Sparck Jones
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
238 Retrieval system tests 1958-1978
being the large document sets used, for example Olive et al.'s of 12 765.
Recall and precision evaluation with retrieved sets was usual, even for Miller,
who applied a threshold to the ordered weighted output. However Evans and
Aitchison et al. evaluated over rankings.
The findings for the search logic tests were competitive and even superior
performance for the ordering methods. For example, Evans found relative
recall for ordering methods at a cutoff of 25 ranged from 45.2 to 49.6 per cent,
compared with that for weighted boolean matching of 40.8 per cent, while an
alternative method of comparison found the ordering strategies and simple
boolean searching similar. Miller had precision of 17 per cent and relative
recall of 46 per cent for boolean compared with 15 and 64 per cent for cutoff
ordering. Cleverdon had boolean performance for different languages ranging
from 27 to 51 per cent precision for 52-30 per cent recall (by external
sample), with non-boolean at a middling co-ordination level ranging from 69
to 68 per cent precision for 25-39 per cent recall. It is interesting to note the
very low recall levels of Aitchison et al.'s boolean searches. It is also of
interest that Barraclough et al's Medusa observations show users most offen
starting with narrow searches and broadening them, effectively following the
co-ordination strategy of Cranfield 2. Aitchison et al.'s comparisons of
formulations show large performance differences, but these are in part due to
variations in document indexing exhaustivity. For example Aitchison et al's
searches ranged from 28-67 per cent precision with 50-11 per cent recall for
broad formulations to 50-75 per cent precision with 9[OCRerr] per cent recall for
narrow, though graph comparisons show rather smaller differences. The
findings for the user studies show the competing alternatives very similar, for
example performance for users and experts respectively in Barber et al.'s test.
The authors generally interpret the findings as showing that the various
simpler approaches advocated are justified. The more general implication of
the tests, taken together, is that different strategies of the same general type
produce very similar results, and even that strategies of quite different types
may do so. However the tests all support the proposition that it is worth
taking some trouble about the search specification: the `simpler' approaches
tested were by no means crude.
The tests so far mentioned dealt with systems which were wholly or
essentially manual, i.e. the document indexing might be manual and the
request indexing certainly was, even if the searching process was executed
mechanically. Thus in systems with automatic scanning of titles or abstracts,
like those discussed by the UKCIS workers, the real work was done by the
manually constructed profiles. Automatic systems at their most exigent, like
those studied by Smart, involve automatic indexing of documents and of
requests, represented by initial user need statement texts, or at least
substantial automatic modification of given manually-constructed document
and request descriptions. The phrase `automatic indexing', while loosely
applicable to automatic scanning, is more properly applied to more extensive
automatic processing, which in the 1 970s was focused largely on the treatment
of requests, compared with the earlier concern with document indexing.
Automatic indexing tests
As noted, the work on automatic indexing and searching of the 1970s [OCRerr]