IRE Information Retrieval Experiment Laboratory tests of manual systems chapter E. Michael Keen Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. Test reliability and the future 153 conducted in other tests, often from 45 to 60 per cent. Only the Cranfield imalyses revealed the extent to which failure is avoidable or not: 51 per cent of them were, and of these 22 per cent were due to lack of time allowed in indexing and 17 per cent due to question misunderstanding. The effort and subjectivity of conducting failure analyses are the problems. Too many l1[OCRerr]horatory tests have had to spend most of their time building their test indexes, and have had to cut back on search testing, and even miss out diagnostics altogether. Assignment of reasons to cases of failure can be complex, but the use of multiple reasons seems a reasonable technique. No wonder operational testing is a rarity, when even laboratory work poses such severe practical difficulties. 8.5 Test reliability and the future As a new decade of evaluation testing is reached, a candid look at the past is needed. Tests have not yet covci'ed all the variables we know about, let alone those remaining undisclosed. We don't know enough about effects of experimental scale, or the continued use of dated test materials. We do so often set up an investigation or experiment first, then afterwards explore the variables or pose the hypotheses. But, on the other hand, we now have plenty of test evidence to argue about, and we do have a clearer view of the design parameters in information retrieval. It could be argued that only the scientific purist could expect a cleaner state of affairs, and such lack of progress is by no means confined to our owii field. Information retrieval testing has frequently proceeded in a series of loosely linked investigations, with the inconclusive end to one piece of research providing the impetus for the next. For example, Table 8.1 presents a set of results that are not understood and are a current anomaly: three test methods (A, B and C) were used to look at printed index entry processing speed. Methods A and B provide similar results: the four different entry types, though they have considerably different lengths of entry, are processed at very similar speeds. But why does method C conflict? Why don't the indexes with the longer entries take longer to process? Why don't the entries that prompt greater amounts of grammatical transformation take longer to process? Hopefully, future work will explore and eventually explain these anomalies. TABLE 8.1. Results of three tests from EPSILON taken fror. Tables C/2, C/4 and C/6 in Keen htdex Entr3 Entries per minute Entries length grammat&[OCRerr]alli' (terms) Total search Fulla' processed trans,fi)rmed subset A A B C Rotated term 4.6 3.50 NA 7.29 47 Rotated string 7.9 3.44 10.35 7.95 4,, Articulated prepositional 6.9 3.41 11.20 8.04 32; Shunted relational 4.6 3.29 9.81 8.17 4 A: Search test; B: Scanning test, C: Audio test. All data are arithmetic means. NA: Not avaitabte.