IRE Information Retrieval Experiment The pragmatics of information retrieval experimentation chapter Jean M. Tague Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. I 78 The pragmatics of information retrieval experimentation least two command language modes: learner mode and experienced or abbreviated mode. A `help' feature, which permits users to access online explanations of the various commands in the retrieval system, is very useful. (4) Retrieval systems should provide facilities for automatic collection of data needed by the experimenter, for example, number of search statements entered, number of documents retrieved by a search statement, number of postings for any term, search time (both connect time and CPU time). (5) Retrieval systems to be used in experiments should provide a variety of outputs-title, full citation, abstracts, term or term combination causing retrieval and so on. Facilities should be provided for offline printing of large output sets. If output is to be evaluated, the system should provide it in a suitable form. Users may want to retain the output and so it should be duplicated, one set for the user, one for the experimenter. 5.7 Decision 7: How will treatments be assigned to experimental units? A complete information retrieval experiment is concerned with assessing the effects of one or more classes of treatments or factors on one or more criterion measures, where the criterion measure is determined for each of a sample of experimental units. For example, the treatments might be different degrees of vocabulary control, the experimental units searches of queries on a database, and the criterion measures recall and precision. Or the treatments might be degree of search delegation to an intermediary, the experimental units online searches of queries on several systems, and the criterion measure total search time. In a partial test, the treatments might be represented by levels of indexer training, the experimental units documents to be indexed, and the criterion measure indexing time. In multi-factor experiments, there is more than a single set of treatments or factors. For example, in complete retrieval experiments, frequently the type of indexing language and the searcher are varied over the query set, giving a two-factor experiment. Or in an online experiment, three factors might be degree of delegation, online system, and searcher. A source of experimental units which is not expected to interact with the factors is called a block. In information retrieval experiments, sources of queries or users such as different libraries might be considered blocks. In a completely crossed factorial experiment, at least one experimental unit is assigned to each possible combination of factors. Thus, in the two- factor design, indexing language by searcher, in a completely crossed experiment, unique queries would be assigned to each combination of searcher and index language. Thus, if we let gl, g2, g3 represent the three languages, sl, s2, s3, the three searchers, and Ql, Q2, . . . , Q12 twelve query sets, a completely-crossed design would be represented as follows: gi g2 g3 51 Ql Q2 Q3 s2 Q4 Q5 Q6 s3 Q7 Q8 Q9 s4 QlO Qil Q12 1