IRE
Information Retrieval Experiment
The pragmatics of information retrieval experimentation
chapter
Jean M. Tague
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
Decision 7: How will treatments be assigned to experimental units? 81
mcasure among treatments, one must be able to assume that all interaction
cI[OCRerr]cts are negligible. If rows represent searchers and columns languages,
there may be an interaction between searcher and language resulting from
the fact that certain searchers may find certain languages particularly
%.`ympathetic or difficult. In this case, a Latin square should not be used.
lowever, when Latin squares are repeated as part of a larger design,
interactions may in part be tested.
In the following design, three balanced Latin squares are used. Search
order is indicated by the symbols ol, o2, and o3.
sl s2 s3
ol gl Ql Q2 Q3
g2 Q2 Q3 Ql
g3 Q3 Ql Q2
o2 gl Q2 Q3 Ql
g2 Q3 Ql Q2
g3 Ql Q2 Q3
o3 gl Q3 Ql Q2
g2 Ql Q2 Q3
g3 Q2 Q3 Ql
In this design, it is assumed there are no interactions with the order factor.
However, other interactions such as searcher can be tested.
Another example of a repeated Latin square design is given by Keen and
Wheatley3. Here (see Figure 5.2) an incomplete block design is used, in
which each block is a Latin square: searcher by order by language. The
blocks are incomplete because only a subset of the queries and of the searches
occur in each block.
Another kind of sequence effect is involved in the fatigue factor, which can
occur in either indexing or searching. Randomizing or otherwise controlling
the order in which treatments are applied within a specified time period, for
example a day, will reduce this problem. The point at which fatigue sets in
can sometimes be determined during preliminary practice sessions. During
the actual experiments, scheduling should then terminate activities at this
point.
The number of queries is another aspect of experimental design. The more
factors under experimental control, the larger must be the query set. Some
replication is desirable within each cell (combination of factors). Thus, the
more factors to be studied and/or controlled, the larger the sample size
required.
For classical, F-test, analysis of variance (to be discussed in the next
section), Winer'8 provides a method of determining the sample size per cell
which will detect a stated minimum difference d among k treatment means
at a specified significance level, [OCRerr] and powerp. For example, for [OCRerr]=O.O5,p=
0.9, d=s/4, where S is the sample standard deviation obtained from a
previous sample, and k =5, the sample size per cell is approximately 56. If d
is doubled to s/2, the sample size is approximately 14, i.e. it is reduced by a
factor of 4. In general, to double the discrimination power of a test one needs
to quadruple the sample size.