This design has the property that the
"treatment effect", here the difference (V1-V2) in search performance
between the two system variants as measured for example by average
precision, can be
estimated free and clear of the main (additive) effects of searcher
and topic. Here, searcher and topic are treated statistically as
blocking factors. This means that even in the presence of differences
between searchers and topics, which clearly are anticipated, the
design will provide estimates of V1-V2 that are not contaminated by
these differences.
Here is a model equation for the latin square. For example, the
performance of S1 using V1 on T1 can be modeled (ignoring for now any
interactions) as:
m + s1 + v1 + t1 + e
(where: m is the grand mean of all performances, s1 is the effect
of searcher 1, v1 is the effect of system variant 1, t1 is the
effect of topic 1, and e is "error" - the effect of everything else.)
The treatment effect (x), i.e., the difference between systems'
performance, is estimated by the mean of the two V1-V2 differences,
from which the main effects of topic and searcher fall out, leaving the
system difference:
x = ( [(m+s1+t1+v1+e)-(m+s1+t2+v2+e)] + [(m+s2+t2+v1+e)-(m+s2+t1+v2+e)] ) / 2
= ( [ t1-t2+v1-v2] + [t2-t1+v1-v2] ) / 2
= ( 2*v1 - 2*v2 ) / 2
= v1 - v2
For TRECVID 2003 we need to expand the design to an even 24 topics by
replicating the 2x2 square to create a 2x24 matrix (below). We
permute the columns so a searcher completes all work on one system
variant before beginning any work on the other variant. For each
searcher a tutorial and practice search would precede the first search
on each system. Because each search can take up to 15 minutes, we limit
each searcher to half the topics so the total maximum search time
for any given searcher is 3 hours - already a long time.