IRE
Information Retrieval Experiment
Laboratory tests of manual systems
chapter
E. Michael Keen
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
I
138 Laboratory tests of manual Systems
only require measures of hits and waste, as Cranfield 2, 50 this criterion for
test validity is to be related to the objectives of a given test.
A second desideratum applies to tests in which performance is being
compared under different circumstances, and is that the comparison be so
controlled that the cause of any performance difference may be determined.
This may not always require no more than a single variable to be altered at
any one time, but in the present state of information retrieval theory this is
usually the safest procedure. Another aspect of this requirement is that where
possible the test environment factors should be held constant; for example,
a comparison of manual boolean logic search results with those of a search
path from a ranked search output would be best made on one test collection
using the same set of search requests and relevance decisions. It is recognized
that some comparisons cannot be made without some change in environment
factors, such as the comparison of general and specific requests requiring
different request sets, but then there would be an advantage in keeping the
document collection unchanged.
A third criterion for acceptability is the practical matter of the availability
of a full report describing the test in sufficient detail. There have been
suggested minimum lists of matters to be included in reporting evaluation
tests but nothing has gained acceptance. If the method used for some vital
part of an experiment cannot be determined, then its results are really as
suspect as those from tests known to be inadequate.
8.2 Test types
The history of laboratory manual testing seems to consist of only a few large
studies, each one looking at a number of the basic parameters that govern the
behaviour of information retrieval systems. Few hypotheses have been
clearly formulated, but these tests constitute a host of quite tight experiments
that have given us most of the light we have on index languages, indexing
and searching. Examples of tests will now be given, categorized for
convenience into index language comparisons, indexing and searching
experiments and printed index comparisons. The writer's own work will
often be used as the main illustrations of these distinctive test types, so other
studies would need to be added for a comprehensive picture. Some of the
findings and conclusions of this testing activity will be given in the next
section.
Index language comparisons
Cranfield 1 remains a classic set of experiments in objectives, details and
procedures1' 2 [OCRerr] provided all the necessary and sufficient circumstances for
testing. All subsequent tests have, knowingly or unknowingly, faced the
same problems, but rarely with the common sense and ingenuity of Cyril
Cleverdon. What was tested has already been briefly described. Overlapping
it in time was a test of a faceted classification used manually, versus a
complex semantic code and role operator system with machine searching,
known as the Western Reserve University test8. Here Cyril Cleverdon and
Jean Aitchison showed that a small test collection in laboratory search
1+I