Two designs for interactive video search experiments


Design for measuring and comparing the effectiveness of 2 systems

The following is a design for an interactive video retrieval experiment aimed at measuring and comparing the effectiveness of  two system variants (V1, V2) using 24 topics (Tn) and either 8, 16, or 24 searchers (Sn), each of which searches 12 topics (assuming at most 15 minutes per search). No searcher sees searches on the same topic more than once. 

The design allows the estimation of the difference in performance between two system variants run at one site, free and clear of the main (additive) effects of searcher and topic and the gathering of some information about interactions. It does not solve cross-site comparison problems.

The full design is built of many 2-searcher-by-2-topic latin square designs. This 2x2 design is a latin square design. 


T1
T2
S1
V1
V2
S2
V2
V1

In should be interpreted as follows:
This design has the property that the "treatment effect", here the difference (V1-V2) in search performance between the two system variants as measured for example by average precision, can be estimated free and clear of the main (additive) effects of searcher and topic. Here, searcher and topic are treated statistically as blocking factors. This means that even in the presence of differences between searchers and topics, which clearly are anticipated, the design will provide estimates of V1-V2 that are not contaminated by these differences.

Here is a model equation for the latin square. For example, the performance of S1 using V1 on T1 can be modeled (ignoring for now any interactions) as:
    m + s1 + v1 + t1 + e
(where:  m is the grand mean of all performances, s1 is the effect of searcher 1, v1 is the effect of system variant 1,  t1 is the effect of topic 1, and e is "error" - the effect of everything else.)

The treatment effect (x), i.e., the difference between systems' performance, is estimated by the mean of the two V1-V2 differences, from which the main effects of topic and searcher fall out, leaving the system difference:
x  = ( [(m+s1+t1+v1+e)-(m+s1+t2+v2+e)] + [(m+s2+t2+v1+e)-(m+s2+t1+v2+e)] ) / 2
   = ( [ t1-t2+v1-v2] + [t2-t1+v1-v2] ) / 2
   = ( 2*v1 - 2*v2 ) / 2
   = v1 - v2

For TRECVID 2003 we need to expand the design to an even 24 topics by replicating the 2x2 square to create a 2x24 matrix (below).  We permute the columns so a searcher completes all work on one system variant before beginning any work on the other variant. For each searcher a tutorial and practice search would precede the first search on each system. Because each search can take up to 15 minutes, we limit each searcher to half  the topics so the total maximum search time for any given searcher is 3 hours - already a long time. 

The estimate of V1-V2 is contaminated to some extent by the presence of an interaction between topic and searcher.  An interaction occurs when the effect of one factor is dependent on the level of another. Therefore, we replicate the 2x24 design in pairs of searchers to create an 8x24 design, so the contaminating effect of the topic-by-searcher interaction is reduced by averaging the multiple estimates of V1-V2 that are available, one for each 2x2 latin square. This is analogous to averaging replicate measurements of a single quantity in order to reduce the measurement uncertainty. (Full information on interactions is possible only if we allow participants to observe the same topic with both the control and experimental systems, i.e., by using a full factorial design and this is not acceptible due
to learning effects.)



T1
T2
T3
T4
T5
T6
T7
T8
T9
T10
T11
T12
T13
T14
T15
T16
T17
T18
T19
T20
T21
T22
T23
T24
S1
V1
V1
V1 V1 V1 V1





V2 V2 V2 V2 V2 V2





S2
V2 V2 V2 V2 V2 V2





V1 V1 V1 V1 V1 V1





S3
V1 V1 V1 V1 V1 V1











V2 V2 V2 V2 V2 V2
S4
V2 V2 V2 V2 V2 V2











V1 V1 V1 V1 V1 V1
S5






V1 V1 V1 V1 V1 V1 V2 V2 V2 V2 V2 V2





S6






V2 V2 V2 V2 V2 V2 V1 V1 V1 V1 V1 V1





S7






V1 V1 V1 V1 V1 V1





V2 V2 V2 V2 V2 V2
S8






V2 V2 V2 V2 V2 V2





V1 V1 V1 V1 V1 V1

Searchers should be assigned randomly.  The order of topic presentation for a given system and searcher can be randomized. The above design can be repeated with up to 2 addtional sets of 8  searchers.

Design for measuring the effectiveness of  1 system

The following is a design for an interactive video retrieval experiment aimed at measuring the effectiveness of  one  system (V1) using 24 topics (Tn) and either 6, 12, 18, or 24 searchers (Sn), each of which searches 12 topics (assuming at most 15 minutes per search).  The more searchers, the better the balance of order related biases. No searcher sees searches on the same topic more than once.


T1
T2
T3
T4
T5
T6
T7
T8
T9
T10
T11
T12
T13
T14
T15
T16
T17
T18
T19
T20
T21 T22
T23
T24
S1
V1
V1
V1
V1
V1
V1
V1
V1
V1
V1
V1
V1












S2
V1
V1
V1
V1
V1
V1






V1
V1
V1
V1
V1
V1






S3
V1
V1
V1
V1
V1
V1












V1
V1
V1
V1
V1
V1
S4






V1
V1
V1
V1
V1
V1
V1
V1
V1
V1
V1
V1






S5






V1
V1
V1
V1
V1
V1






V1
V1
V1
V1
V1
V1
S6












V1
V1
V1
V1
V1
V1
V1
V1
V1
V1
V1
V1

Searchers are assigned randomly.  The order of topic presentation for a given searcher can be randomized. The above design can be repeated with up to 3 addtional sets of 6 searchers. The more searchers, the better the balance of order related biases.