IRE Information Retrieval Experiment The pragmatics of information retrieval experimentation chapter Jean M. Tague Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. Decision 7: How will treatments be assigned to experimental units? 81 mcasure among treatments, one must be able to assume that all interaction cI[OCRerr]cts are negligible. If rows represent searchers and columns languages, there may be an interaction between searcher and language resulting from the fact that certain searchers may find certain languages particularly %.`ympathetic or difficult. In this case, a Latin square should not be used. lowever, when Latin squares are repeated as part of a larger design, interactions may in part be tested. In the following design, three balanced Latin squares are used. Search order is indicated by the symbols ol, o2, and o3. sl s2 s3 ol gl Ql Q2 Q3 g2 Q2 Q3 Ql g3 Q3 Ql Q2 o2 gl Q2 Q3 Ql g2 Q3 Ql Q2 g3 Ql Q2 Q3 o3 gl Q3 Ql Q2 g2 Ql Q2 Q3 g3 Q2 Q3 Ql In this design, it is assumed there are no interactions with the order factor. However, other interactions such as searcher can be tested. Another example of a repeated Latin square design is given by Keen and Wheatley3. Here (see Figure 5.2) an incomplete block design is used, in which each block is a Latin square: searcher by order by language. The blocks are incomplete because only a subset of the queries and of the searches occur in each block. Another kind of sequence effect is involved in the fatigue factor, which can occur in either indexing or searching. Randomizing or otherwise controlling the order in which treatments are applied within a specified time period, for example a day, will reduce this problem. The point at which fatigue sets in can sometimes be determined during preliminary practice sessions. During the actual experiments, scheduling should then terminate activities at this point. The number of queries is another aspect of experimental design. The more factors under experimental control, the larger must be the query set. Some replication is desirable within each cell (combination of factors). Thus, the more factors to be studied and/or controlled, the larger the sample size required. For classical, F-test, analysis of variance (to be discussed in the next section), Winer'8 provides a method of determining the sample size per cell which will detect a stated minimum difference d among k treatment means at a specified significance level, [OCRerr] and powerp. For example, for [OCRerr]=O.O5,p= 0.9, d=s/4, where S is the sample standard deviation obtained from a previous sample, and k =5, the sample size per cell is approximately 56. If d is doubled to s/2, the sample size is approximately 14, i.e. it is reduced by a factor of 4. In general, to double the discrimination power of a test one needs to quadruple the sample size.