CBCD Evaluation Plan TRECVID 2009 (v1)

Evaluation

First we define a version of precision and recall that can be used to measure the accuracy of locating a copied fragment within a video. Precision is defined as the percentage of the asserted copy that is indeed an actual copy and recall is defined as the percentage of the actual copy that is subsumed in the asserted copy. Now F1 is defined as the harmonic mean of precision and recall.

TRECVID 2009 CBCD systems will be evaluated on:

1. How many queries they find the reference data for or correctly tell us there is none to find (by submitting zero hits). The reference data has been found if and only if:

A. the asserted test video ID is correct and

B. no two query result items for a given video can overlap

C. at least one submitted extent overlaps to some degree with the reference extent. In case multiple submitted extents overlap with the reference extent, ONE mapping of submitted extents to ref extents for each result set will be determined and one candidate submitted extent will be chosen based on a combination of the F1 (between submitted and ref extents) and the decision score for each item. This alignment will be performed using the Hungarian Solution to the Bipartite Graph matching problem by modeling event observations as nodes in the bipartite graph.

2. When a copy is detected, how accurately the run locates the reference data in the test data.

3. How much elapsed time is required for query processing.

Two differences from 2008’s evaluation are:

1. In 2009 we will require all participating groups to work on video-only and the audio+video queries. Work on the audio-only queries will be optional. For each query type we will require participating groups to submit at least 2 runs which differ only in the application profile (and associated operating point) for which the run is optimized: “no false alarms” or “balanced”, as described in the guidelines.

2. In 2009 we will require all participating groups to provide a decision score threshold value believed to correspond to the best performance for the run, measured in terms of NDCR. As in 2008, the evaluation will calculate the minimal (best possible) decision score threshold for each run for comparison with the one actually submitted

The following measures will be used:

1: Actual and Minimal Normalized Detection Cost Rate and P_Miss-R_FA plot

The detection effectiveness will be measured for each individual transformation. For each run, all results of individual transformations will be concatenated in separate files and sorted by decision score. Subsequently, each concatenated file (corresponding to a single transformation across all queries from a given run) will be used to compute the probability of a miss error and the false alarm rate (P_Miss and R_FA) at different operating points, by truncating the list at a range of decision thresholds θ, sweeping from the minimum decision score to the maximum score. As a first step, asserted copies that overlap will be logged and removed from consideration. Secondly, the computation of true positives will be based on only one submitted extent per query (as defined by the mapping procedure outlined above). All other submitted extents for this query count as false alarms. This procedure yields a list of pairs of increasing P_Miss and decreasing R_FA values. These data points will be used to create a P_Miss versus R_FA error plot (DET curve) for a given run and transformation. The two error rates are then combined into a single detection cost rate, DCR, by assigning costs to miss and false alarm errors:

where

C_Miss and C_FA are the costs of a Miss and a False Alarm, respectively,

P_Missand R_FA are the conditional probability of a missed copy and the false alarm rate respectively;

P_Miss= FN/(N_target) measured on the queries containing a copy (per transformation)(micro-average),

R_FA= FP/(T_queries) measured on the full set of queries, where T_queries is the total length (in hours) of all queries (per transformation),

R_target is the a priori target rate.

For this year, the parameters are defined as follows for the "no false alarm" profile:

R_target= 0.5/hr ,

C_Miss = 1

C_FA = 1000

For this year, the parameters are defined as follows for the "balanced" profile:

R_target= 0.5/hr ,

C_Miss = 1

C_FA = 1

There are two factors that the cost function separates: the prior for the likelihood of an event occurring and the application model that defines the relative importance of the two error types to the application. Using Rtarget in the formulas has the effect of equalizing the error types with respect to the priors. For instance in our case, the Rfas will be low compared to Pmiss, because the prior is so low for a target. Pmiss*Rtarget reduces the influence of the misses. Once the error rates are equalized, the costs set the relative importance of the error types to the modeled application.

In order to compare the detection cost rate values across a range of parameter values (e.g. with future CBCD tasks), DCR is normalized as follows:

, when we define β = (C_FA / (C_Miss . R_target) , this leads to:

The minimal normalized detection cost rate (as a function of the decision threshold θ) and associated decision threshold will be computed for each transformation, for each run.

The actual normalized detection cost rate will also be computed for each transformation and run using the submitted decision threshold value for optimal performance (see below).

2: Copy location accuracy

In the scenario for 2009 this separate measure is seen as a secondary, diagnostic. It aims to assess the accuracy of finding the exact extent of the copy in the reference video, once the system has correctly detected a copy. The detection accuracy will be measured for each individual transformation. The asserted and actual extents of the copy in the reference data will be compared using precision and recall and these two numbers will be combined using the F1 measure, recall and precision will be measured at the optimal operating point where the normalized detection cost is minimal and also at the submitted run threshold.

3: Copy detection processing time

Mean time (in seconds) to process a query. Processing time is defined as the full time required to process queries from mpg files to result file (including decoding, analysis, features extraction, eventual write/read of intermediate results, eventual loadings of reference subsets, results output). Mean time is the full processing time normalized by the number of queries.

Submission requirements

Since the evaluation plan is based on evaluating systems on a range of operating points, it is important that systems overgenerate, i.e. output multiple candidate copies for each query with associated decision scores. We strongly suggest that systems compute at least one copy candidate alignment result line for each query and most videos in the reference video dataset, unless the decision score for a certain query video pair is zero. In these cases, no result line is necessary. It is allowed to output multiple asserted copies within one reference video. Be aware though, that the false alarm rate is a heavy component in the evaluation. Note also that asserted copies may not overlap. Generating multiple asserted copies per reference video is therefore a matter of a recall precision trade-off..

A run is the output of a system (with a given set of parameters, training, etc) executed against all of the queries appropriate for the run type (video-only, audio-only, video+audio). A run will contain the following information in the following order, all in ASCII, one item per line unless otherwise indicated. The uppercase letter at the start of each line indicates the kind of data that follows.

I runId - an ASCII string of not more than 10 alphanumeric characters, chosen by the submitting group, identifying the run uniquely for the submitting group. Note, that this ID DOES NOT identify the participating group. NIST will add a separate group identifier. For example, a group could simply use a digit ("1", "2", ...) or something more explanatory ("sampled","full",...).
P profile - an ASCII string indicating the application profile the system was tuned to, either "NOFA" or "BALANCED"
V threshold value for optimal performance – provide a string representing a real number (optionally with a decimal point (“.”) and digits to the right of the decimal point). Used to calculate actual NDCR
S name of the operating system(s) used
C model of cpu used
M amount of memory available
table of processing times - one line for each query, where the elapsed time to process the query and complete the search is a real number representing seconds (optionally with a decimal point (“.”) and digits to right of it). Time spent analyzing the test collection video before query processing begins will not be included.

    T queryId elapsedQueryProcessingTimeInSeconds

table of result items It will contain no items if no copy is found.
Each result item will occupy one line and will include the following, in ASCII, arranged from left to right, separated by one or more spaces:

R - indicates result item data follows
queryId - a string, assigned by NIST, denoting the number of the query
videoId - the file name of the reference video from which the found copy came (name formatted exactly as found in the test data, e.g., BG_12332.mpg, not "bg_12332.MPG" or "BG_12332" etc.)
firstRefFrameTimeCode - the time code in the reference video of the first frame of the found copy. In the case of v+a queries use the first frame of the video copy.
lastRefFrameTimeCode - the time code in the reference video of the last frame of the found copy. In the case of v+a queries use the last frame of the video copy.
decisionScore - a real number (optionally with a decimal point (“.”) and digits to the right of the decimal point), indicating the relative weight of evidence that the copy was found in a given query. The higher the number, the stronger the evidence. Systems must ensure their decision scores have the following two characteristics: first, the values must form a non-uniform density function so that the relative evidential strength between two putative terms is discernable. Second, the decision score function must be consistent across queries for a single system so that measures using decision scores from multiple queries are meaningful. Scores may NOT be consistent in meaning across systems.
firstQueryFrameTimeCode is the time code in the query of the first frame of the found copy. In the case of v+a queries use the first frame of the video copy.

Note 1: Time codes will be expressed using just digits (0-9) followed optionally by a one decimal point (".") and additional digits, no other characters, and represent the number of elapsed seconds since the start of the reference or query video.

Note 2: Within a given result set for a given query, no two result items may overlap (i.e., have the same videoId and overlapping temporal extents. All such overlapping result items will be removed from consideration.

Submissions will be transmitted to NIST via a webpage on the schedule listed in the guidelines webpage.. A Perl script, which looks for some common format errors, is available in the active participant's area. Any run which contains errors will be rejected by the submission webpage. Here is an initial brief sample of one system's baseline line run submission for the video-only queries task