CBCD
Evaluation Plan TRECVID 2009 (v1)
First we define a version of precision and recall that can be used to
measure the accuracy of locating a copied fragment within a video. Precision is
defined as the percentage of the asserted
copy that is indeed an actual copy and recall is defined as the percentage of
the actual copy that is subsumed in the asserted copy. Now F1 is defined as the
harmonic mean of precision and recall.
TRECVID 2009 CBCD systems will be evaluated on:
1. How many queries they find the reference data for or correctly tell
us there is none to find (by submitting zero hits). The reference data has been
found if and only if:
A. the asserted test video ID is correct and
B. no two query result items for a given video can
overlap
C. at least one submitted extent overlaps to
some degree with the reference extent. In case multiple submitted
extents overlap with the reference extent, ONE mapping of submitted
extents to ref extents for each result set will be determined and one
candidate submitted extent will be chosen based on a combination of
the F1 (between submitted and ref extents) and the decision score for
each item. This alignment will be performed using the Hungarian
Solution to the Bipartite Graph matching problem by modeling event
observations as nodes in the bipartite graph.
2. When a copy is detected, how accurately the run locates the reference
data in the test data.
3. How much elapsed time is required for query processing.
Two differences from 2008’s evaluation are:
1. In 2009 we will require all participating groups to work on video-only and the audio+video queries. Work on the audio-only queries will be optional. For each query type we will require participating groups to submit at least 2 runs which differ only in the application profile (and associated operating point) for which the run is optimized: “no false alarms” or “balanced”, as described in the guidelines.
2. In 2009 we will require all participating groups to
provide a decision score threshold value believed to correspond to the best
performance for the run, measured in terms of NDCR. As in 2008, the evaluation
will calculate the minimal (best possible) decision score threshold for each
run for comparison with the one actually submitted
The following measures will be used:
1: Actual and Minimal Normalized
Detection Cost Rate and PMiss -RFA plot
The detection effectiveness will be measured for each individual
transformation. For each run, all results of individual transformations will be
concatenated in separate files and sorted by decision score. Subsequently, each
concatenated file (corresponding to a single transformation across all queries
from a given run) will be used to compute the probability of a miss error and
the false alarm rate (PMiss
and RFA) at different
operating points, by truncating the list at a range of decision thresholds θ, sweeping from the minimum
decision score to the maximum score. As a first step, asserted copies that
overlap will be logged and removed from consideration. Secondly, the
computation of true positives will be based on only one submitted extent per
query (as defined by the mapping procedure outlined above). All other submitted
extents for this query count as false alarms. This procedure yields a list of
pairs of increasing PMiss
and decreasing RFA values.
These data points will be used to create a PMiss
versus RFA error plot (DET curve)
for a given run and transformation. The two
error rates are then combined into a single detection cost rate, DCR, by assigning costs to miss and
false alarm errors:
where
CMiss and CFA are the
costs of a Miss and a False Alarm, respectively,
PMiss and RFA are the
conditional probability of a missed copy and the false alarm rate respectively;
PMiss= FN/(Ntarget) measured
on the queries containing a copy (per transformation)(micro-average),
RFA=
FP/(Tqueries) measured on
the full set of queries, where Tqueries
is the total length (in hours) of all queries (per transformation),
Rtarget is the a priori target rate.
For this year, the
parameters are defined as follows for the "no false alarm" profile:
Rtarget = 0.5/hr ,
CMiss = 1
CFA =
1000
For this year, the
parameters are defined as follows for the "balanced" profile:
Rtarget = 0.5/hr ,
CMiss = 1
CFA =
1
There are two factors that
the cost function separates: the prior for the likelihood of an event occurring
and the application model that defines the relative importance of the two error
types to the application. Using Rtarget in the formulas has the effect of
equalizing the error types with respect to the priors. For instance in our
case, the Rfas will be low compared to Pmiss, because the prior is so low for a
target. Pmiss*Rtarget reduces the influence of the misses. Once the error rates
are equalized, the costs set the relative importance of the error types to the
modeled application.
In order to compare the
detection cost rate values across a range of parameter values (e.g. with future
CBCD tasks), DCR is normalized as
follows:
, when we define β = (CFA / (CMiss . Rtarget ) , this leads to:
The minimal normalized
detection cost rate (as a function of the decision threshold θ) and associated decision
threshold will be computed for each transformation, for each run.
The actual normalized
detection cost rate will also be computed for each transformation and run using
the submitted decision threshold value for optimal performance (see below).
2: Copy
location accuracy
In the scenario for 2009 this separate measure
is seen as a secondary, diagnostic. It
aims to assess the accuracy of finding the exact extent of the copy in the
reference video, once the system has correctly detected a copy. The detection accuracy will be measured for
each individual transformation. The asserted and actual extents of the copy in
the reference data will be compared using precision and recall and these two
numbers will be combined using the F1 measure, recall and precision will be
measured at the optimal operating point where the normalized detection cost is
minimal and also at the submitted run threshold.
3: Copy detection processing time
Mean time (in seconds) to
process a query. Processing time is defined as the full time required to process
queries from mpg files to result file (including decoding, analysis, features
extraction, eventual write/read of intermediate results, eventual loadings of
reference subsets, results output). Mean time is the full processing time
normalized by the number of queries.
Since the evaluation plan is based on evaluating systems on a range of operating points, it is important that systems overgenerate, i.e. output multiple candidate copies for each query with associated decision scores. We strongly suggest that systems compute at least one copy candidate alignment result line for each query and most videos in the reference video dataset, unless the decision score for a certain query video pair is zero. In these cases, no result line is necessary. It is allowed to output multiple asserted copies within one reference video. Be aware though, that the false alarm rate is a heavy component in the evaluation. Note also that asserted copies may not overlap. Generating multiple asserted copies per reference video is therefore a matter of a recall precision trade-off..
A run is the output of a system (with a given set of parameters, training, etc) executed against all of the queries appropriate for the run type (video-only, audio-only, video+audio). A run will contain the following information in the following order, all in ASCII, one item per line unless otherwise indicated. The uppercase letter at the start of each line indicates the kind of data that follows.
T queryId elapsedQueryProcessingTimeInSeconds
Note 1: Time codes will be expressed using just digits (0-9) followed optionally by a one decimal point (".") and additional digits, no other characters, and represent the number of elapsed seconds since the start of the reference or query video.
Note 2: Within a given result set for a given query, no two result items may overlap (i.e., have the same videoId and overlapping temporal extents. All such overlapping result items will be removed from consideration.
Submissions will be transmitted to NIST via a webpage on the schedule listed in the guidelines webpage.. A Perl script, which looks for some common format errors, is available in the active participant's area. Any run which contains errors will be rejected by the submission webpage. Here is an initial brief sample of one system's baseline line run submission for the video-only queries task