IRE Information Retrieval Experiment Laboratory tests of manual systems chapter E. Michael Keen Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. Test types 139 conditions can provide much valuable data even in the realm of retrieval failure analysis. They also rode the storm caused by the unexpected and unwelcome outcome of the comparison, and made people face the possibility that complexity and intelligence at input may not result in a superior result at retrieval. Though Cranfield 2 used machine-like search procedures in testing 29 index language devices3'4, we may say that the findings about the effectiveness of natural language (either indexing or titles) apply to manual Systems as well, unless practical considerations of file or vocabulary size inhibit. The 63 requests and 200 documents subset of the Cranfield 2 aerodynamics collection have probably become the most heavily used test collection. The ISILT experiments9' 10, as Cranfield 1, took the untested debates of the day (minimum vocabulary post-coordinate systems for example) and once again tried to provide measured results to replace unmeasured opinion. Two large index language comparisons that utilized manual indexing and search formulation, but machine searching, were the Case-Western Reserve University test" and Tom Aitchison's INSPEC work'2. Many small scale tests were carried out on the need for syntactic devices (e.g. links and roles) in index languages, and these culminated in tests of the relational indexing system carried out by Jason Farradane'3 and in ISILT. Indexing and searching experinients Index language testing has dominated the main thrust of laboratory investigations in spite of the evidence of Cranfield 1 that it is the operations of indexing and searching that matter most. No large-scale laboratory experiments have tackled these two processes as primary variables, though many tests have experimented with them as secondary variables: all the large index language tests mentioned did so. Cranfield 1 is a classic in this respect. The 18 000 documents, which were indexed by four languages, were built up from batches of carefully selected components. There were different types of document (articles, reports, book sections, etc.), general or specialist subject areas, five time limits allowed for indexing, and individual performance in indexing and searching was related to level of experience and the use of subject specialists versus librarians. Clearly defined parameters of exhaustivity and specificity as they affect both indexing and searching were explored in Cranfield 2 and ISILT. The comparison between pre- and post-coordinate search files was systematically tackled in ISILT, and the phenomenon of'preserving the context' by multiple specific pre-coordinate entry was carried over from ISILT to the printed index experiments known as EPSILON'4-'6. There have also been numbers of smaller projects in which just one of the two processes has been studied, but most such studies on indexing quality or consistency have not reached the status of valid evaluation testing. Turning to tests of the search process, many laboratory experiments have employed very strict controls. It is true that in operational tests manual search formulation and strategy can vary dramatically from person to person, as was clearly seen in the Medusa current awareness work'7. One experimental method is to obtain results by progressively broadening the