IRE Information Retrieval Experiment The pragmatics of information retrieval experimentation chapter Jean M. Tague Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. 5 The pragmatics of information retrieval experimentation Jean M. Tague l'he novice information scientist, though he or she may have thoroughly [OCRerr]tudied the design and results of previous information retrieval tests and clearly described the purpose of his/her own test, may still, when faced with its implementation, have great difficulty in proceeding. Farly information retrieval experiments were of necessity ad hoc, and it is only in recent years that a body of practice, based on the experiences of Cleverdon and later investigators, has made possible a few recommendations on the pragmatics of conducting information retrieval experiments. The following remarks, though based to some extent on a study of the major tests, including those described in later chapters of this book, are heavily dependent on the author's own trials, tribulations, and mistakes If there is one lesson to be learned from experience, it is that the theoretically optimum design can never be achieved, and the art of information retrieval experimentation is to make the compromises that will least detract from the usefulness of the results. In determining experimental procedures, three aspects must be kept in mind: (I) The validity of the procedure; does it determine what the experimenter wishes to determine? If a study is being made of the relation of document scope to user satisfaction, does the use of number of citations as a measure of scope and number of references marked relevant' as a measure of satisfaction really fulfill this purpose? (2) The reliability of the procedure; can it be replicated by another experimenter? If one is addressing the problem of inter-indexer consistency, will a test of the consistency of two indexers indexing 10 documents from a single journal provide results which can be replicated elsewhere? A procedure may be reliable without being valid, i.e. it may give consistent results but be measuring something else. (3) The efficiency of the test procedures; how long will it take, how many resources- people, computing, supplies, equipment -will it require, how much will it cost? Is it sensible, for example, to assess the absolute recall of searches when this means each user will have to peru[OCRerr]e the entire database? What limitations will this place on the size of the database'? 5,)