IRE Information Retrieval Experiment Retrieval system tests 1958-1978 chapter Karen Sparck Jones Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. 214 Retrieval system tests 1958-1978 of effectiveness, somehow defined. The difference between experiment and investigation is therefore that in experiment explicit comparative measure- ments are required for different values of the test variables; in investigation9 comparison may be only implicit. Further, since retrieval systems have a function, evaluation experiments relating specifically to performance effectiveness, i.e. the ability of the system to retrieve relevant documents and to suppress non-relevant ones, have a special status as the most important kind of experiment. Unfortunately, the distinction between experiment and investigation just summarized is an ideal which is very difficult to maintain when discussing actual system tests. Much of the work done cannot be described as unequivocally experimental or investigative, expecially where studies or operational systems are concerned. The problem is really that information retrieval systems are so complicated, and so little understood, and there is such a lack of solid theory about them, that really high class experiment can hardly be expected. In a way a review of information retrieval experiment is a review of the inadequacy of information retrieval experiment. The work discussed in this chapter thus ranges from experiments proper to better conducted and relatively systematic investigations. A particular problem is that while both experiment and investigation can in principle refer to operational system studies, in practice there have been few thoroughly controlled operational system tests, and experiment and investigation typically imply laboratory and operational environments respectively. There are indeed, as is noted in other chapters, considerable difficulties about conducting rigorous operational system experiments. Within the area of information retrieval experiment we can then sort the tests done according to the degree of control they involved, and according to the type of hypothesis they invoked. Control is exhibited by comparison, and the degree of control corresponds largely to the scope or level of the factor being studied as the primary experimental variable, i.e. variable on which the experimenter's interest is focused. Thus at the highest level we may compare whole indexing and searching subsystems within the fixed environment represented by a certain body of users and of literature; at the medium level we may compare different indexing thesauri; and at the lowest level we may vary indexing exhaustivity using a given thesaurus. As long as the environment parameters are held constant, all of these are comparisons implying some degree of control, but in the case of whole indexing or searching subsystems control will be minimal. The consequence is that any observed differences (or similarities) in system performance will not be explicable in any detail since the indexing and searching subsystem as a whole subsumes many lower-level variables. The problem most severely felt by research workers has been that of identifying useful, meaningful unit variables in retrieval systems, i.e. those variables capable of affecting performance for determinable reasons. A closely related problem is that or managing secondary, related variables, since their identification and manipulation are associated with the treatment of the primary variables. The treatment of indexing exhaustivity and specificity in relation to an index language are good examples of this problem. Similar points can be made about the hypotheses underlying information retrieval experiment. Some hypotheses are rather general, for instance that