IRE
Information Retrieval Experiment
Laboratory tests of manual systems
chapter
E. Michael Keen
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
Test reliability and the future 153
conducted in other tests, often from 45 to 60 per cent. Only the Cranfield
imalyses revealed the extent to which failure is avoidable or not: 51 per cent
of them were, and of these 22 per cent were due to lack of time allowed in
indexing and 17 per cent due to question misunderstanding. The effort and
subjectivity of conducting failure analyses are the problems. Too many
l1[OCRerr]horatory tests have had to spend most of their time building their test
indexes, and have had to cut back on search testing, and even miss out
diagnostics altogether. Assignment of reasons to cases of failure can be
complex, but the use of multiple reasons seems a reasonable technique. No
wonder operational testing is a rarity, when even laboratory work poses such
severe practical difficulties.
8.5 Test reliability and the future
As a new decade of evaluation testing is reached, a candid look at the past is
needed. Tests have not yet covci'ed all the variables we know about, let alone
those remaining undisclosed. We don't know enough about effects of
experimental scale, or the continued use of dated test materials. We do so
often set up an investigation or experiment first, then afterwards explore the
variables or pose the hypotheses. But, on the other hand, we now have plenty
of test evidence to argue about, and we do have a clearer view of the design
parameters in information retrieval.
It could be argued that only the scientific purist could expect a cleaner state
of affairs, and such lack of progress is by no means confined to our owii field.
Information retrieval testing has frequently proceeded in a series of loosely
linked investigations, with the inconclusive end to one piece of research
providing the impetus for the next. For example, Table 8.1 presents a set of
results that are not understood and are a current anomaly: three test methods
(A, B and C) were used to look at printed index entry processing speed.
Methods A and B provide similar results: the four different entry types,
though they have considerably different lengths of entry, are processed at
very similar speeds. But why does method C conflict? Why don't the indexes
with the longer entries take longer to process? Why don't the entries that
prompt greater amounts of grammatical transformation take longer to
process? Hopefully, future work will explore and eventually explain these
anomalies.
TABLE 8.1. Results of three tests from EPSILON taken fror. Tables C/2, C/4 and C/6 in Keen
htdex Entr3 Entries per minute Entries
length grammat&[OCRerr]alli'
(terms) Total search Fulla' processed trans,fi)rmed
subset
A A B C
Rotated term 4.6 3.50 NA 7.29 47
Rotated string 7.9 3.44 10.35 7.95 4,,
Articulated prepositional 6.9 3.41 11.20 8.04 32;
Shunted relational 4.6 3.29 9.81 8.17 4
A: Search test; B: Scanning test, C: Audio test. All data are arithmetic means. NA: Not avaitabte.