IRE Information Retrieval Experiment The pragmatics of information retrieval experimentation chapter Jean M. Tague Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. Decision 7: How will treatments be assigned to experimental units? 79 The problem with this approach is that there is such a large variation among queries with respect to recall, precision, and other criterion measures of interest to experimenters that these variations may mask variations caused by the indexing language, which the experiment is supposed to determine. Another approach is to use a design with repeated measures. As the name implies, this means that the same experimental unit is subjected to the treatments of interest, i.e. each query is searched using all three indexing languages. Such designs permit control over individual differences. Thus, using the same notation as in the previous example, a two-factor experiment (language by searcher) with repeated measures would look like this: gi g2 g3 si Ql Qi Qi s2 Q2 Q2 Q2 s3 Q3 Q3 Q3 s4 Q4 Q4 Q4 where Qi, Q2, Q3, Q4 are sets of n/4 queries, n being the total number of queries available in the experiment. If instead of assigning different query sets to each searcher one assigns the same set, then the query has in effect become a third factor in a language by searcher by query experiment. Repeated measures designs have the advantage that fewer queries are needed for the same reliability. However, they have the drawback of introducing possible `sequence' effects the effects of practice, training, learning from a search in one indexing language to a search of the same query in another. In his standard text on experimental design, Winer18 says: `In experiments where sequence effects are likely to be marked, a repeated measure design should be avoided. In cases where sequence effects are likely to be small relative to treatment effects, repeated measure designs can be used. Randomizing the order of administration tends to prevent confounding of treatment and sequence effects.' The experimenter must himself judge the magnitude of sequence effects on searchers. One would expect them to be greater with novice than with experienced personnel. Another way to control sequence effects is by using a Latin square design. A Latin square is an n by n table or array in which the entries in the table are n distinct symbols, assigned so that each appears once in each row and in each column. For example, here are two different 3 by 3 Latin squares: 123 132 231 321 312 213 In experimental design, the rows and columns represent levels of two factors (for example, indexing language and search order). The entries in the body of the table represent experimental units or sets of randomly assembled experimental units (for example, sets of queries). Note that for a Latin square to be used as an experimental design one must have mN(R) mN(C) N(Q)