IRE Information Retrieval Experiment Retrieval system tests 1958-1978 chapter Karen Sparck Jones Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. 238 Retrieval system tests 1958-1978 being the large document sets used, for example Olive et al.'s of 12 765. Recall and precision evaluation with retrieved sets was usual, even for Miller, who applied a threshold to the ordered weighted output. However Evans and Aitchison et al. evaluated over rankings. The findings for the search logic tests were competitive and even superior performance for the ordering methods. For example, Evans found relative recall for ordering methods at a cutoff of 25 ranged from 45.2 to 49.6 per cent, compared with that for weighted boolean matching of 40.8 per cent, while an alternative method of comparison found the ordering strategies and simple boolean searching similar. Miller had precision of 17 per cent and relative recall of 46 per cent for boolean compared with 15 and 64 per cent for cutoff ordering. Cleverdon had boolean performance for different languages ranging from 27 to 51 per cent precision for 52-30 per cent recall (by external sample), with non-boolean at a middling co-ordination level ranging from 69 to 68 per cent precision for 25-39 per cent recall. It is interesting to note the very low recall levels of Aitchison et al.'s boolean searches. It is also of interest that Barraclough et al's Medusa observations show users most offen starting with narrow searches and broadening them, effectively following the co-ordination strategy of Cranfield 2. Aitchison et al.'s comparisons of formulations show large performance differences, but these are in part due to variations in document indexing exhaustivity. For example Aitchison et al's searches ranged from 28-67 per cent precision with 50-11 per cent recall for broad formulations to 50-75 per cent precision with 9[OCRerr] per cent recall for narrow, though graph comparisons show rather smaller differences. The findings for the user studies show the competing alternatives very similar, for example performance for users and experts respectively in Barber et al.'s test. The authors generally interpret the findings as showing that the various simpler approaches advocated are justified. The more general implication of the tests, taken together, is that different strategies of the same general type produce very similar results, and even that strategies of quite different types may do so. However the tests all support the proposition that it is worth taking some trouble about the search specification: the `simpler' approaches tested were by no means crude. The tests so far mentioned dealt with systems which were wholly or essentially manual, i.e. the document indexing might be manual and the request indexing certainly was, even if the searching process was executed mechanically. Thus in systems with automatic scanning of titles or abstracts, like those discussed by the UKCIS workers, the real work was done by the manually constructed profiles. Automatic systems at their most exigent, like those studied by Smart, involve automatic indexing of documents and of requests, represented by initial user need statement texts, or at least substantial automatic modification of given manually-constructed document and request descriptions. The phrase `automatic indexing', while loosely applicable to automatic scanning, is more properly applied to more extensive automatic processing, which in the 1 970s was focused largely on the treatment of requests, compared with the earlier concern with document indexing. Automatic indexing tests As noted, the work on automatic indexing and searching of the 1970s [OCRerr]