Final CBCD Evaluation Plan TRECVID 2008 (v1.3)
Wessel Kraaij - TNO; Paul Over, Jon Fiscus - NIST; Alexis Joly – INRIA
June 3, 2008
As used here, a copy is a segment of video derived from another video, usually by means of various transformations such as addition, deletion, modification (of aspect, color, contrast, encoding, ...), camcording, etc. Detecting copies is important for copyright control, business intelligence and advertisement tracking, law enforcement investigations, etc. Content-based copy detection offers an alternative to watermarking. The TRECVID copy detection task will be carried out in collaboration with members of the IMEDIA team at INRIA and will build on work demonstrated at CIVR 2007.
The required system task will be as follows: given a test collection of videos and a set of about 2000 queries (video-only segments), determine for each query the place, if any, that some part of the query occurs, with possible transformations, in the test collection accompanied with a decision score. The decision score is a numeric value indicating the weight of evidence for the presence of a copy, where larger values indicate stronger evidence. The set of possible video transformations are based to the extent possible on actually occurring transformations and are documented in the query construction plan (http://www-nlpir.nist.gov/projects/tv2008/TrecVid2008CopyQueries.pdf).
Each query will be constructed using tools developed by IMEDIA to include some randomization at various decision points in the construction of the query set. For each query, the tools will take a segment from the test collection, optionally transform it, embed it in some video segment which does not occur in the test collection, and then finally apply one or more transformations to the entire query segment. Some queries may contain no test segment; others may be composed entirely of the test segment. Transformations to be used will be published as part of the guidelines after discussion among participants. Here is the current plan for query creation.
Videos often contain audio. Sometimes the original audio is retained in the copied material, sometimes it is replaced by a new soundtrack. Nevertheless, audio is an important and strong feature for some application scenarios of video copy detection. Since detection of untransformed audio copies is relatively easy, and the primary interest of the TV community is in video analysis, it was decided to model the required CD task with video-only queries. However, since audio is of importance for practical applications, there will be two additional optional tasks: a task using transformed audio-only queries and one using transformed audio+video queries.
The audio-only queries will be generated along the same lines as the video-only queries: a set of 201 base audio-only queries is transformed by several techniques that are intended to be typical of those that would occur in real reuse scenarios: (1) bandwidth limitation (2) other coding-related distortion (e.g. subband quantization noise) (3) variable mixing with unrelated audio content. The transformed queries will be downloadable from NIST.
The audio+video queries will consist of the aligned versions of transformed audio and video queries, i.e, they will be various combinations of transformed audio and transformed video from a given base audio+video query. In this way sites can study the effectiveness of their systems for individual audio and video transformations and their combinations. These queries will not be downloadable. Rather, NIST will provide a list of how to construct each audio+video test query so that given the audio-only queries and the video-only queries, sites can use a tool such as ffmpeg to construct the audio+video queries.
The reference dataset consists of approximately 100 hours of Sound & Vision data that was used as training and test data for the TV 2007 search and HLF tasks plus another 100 hours of Sound & Vision data that will be used as test data for the TV 2008 search and HLF tasks. In total there are 438 reference video files.
For development data, participants of the copy detection task can use the MUSCLE-VCD-2007 data. This is the data that was used for the copy detection evaluation at CIVR 2007. Note that the evaluation for TV 2008 has a different set-up, so suitability for development is not optimal.
Note: Since the choice of recall and in particular precision as primary evaluation measures for copy detection introduced a dependency on the class distribution in the test set, it was decided to measure error rates instead. The detection cost function (which is also used in a slightly different version in the event detection task) will be used as a framework to measure the cost associated to using the copy detection system in a particular scenario, e.g. a scenario where most test videos are non copies and the cost of missing a copy is higher than the cost associated with dealing with a false alarm. Another improvement in the final version of the evaluation plan is that more than one false alarm is possible within a reference video. In this more realistic setting, multiple result entries are possible for each query. Only one result entry can potentially lead to a true positive though.
First we define a version of precision and recall that can be used to measure the accuracy of locating a copied fragment within a video. Precision is defined as the percentage of the asserted copy that is indeed an actual copy and recall is defined as the percentage of the actual copy that is subsumed in the asserted copy. Now F1 is defined as the harmonic mean of precision and recall.
TRECVID 2008 CBCD systems will be evaluated on:
1. How many queries they find the reference data for or correctly tell us there is none to find (by submitting zero hits). The reference data has been found if and only if:
A. the asserted test video ID is correct and
B. no two query result items for a given video can overlap
C. at least one submitted extent overlaps to some degree with the reference extent. In case multiple submitted extents overlap with the reference extent, ONE mapping of submitted extents to ref extents for each result set will be determined and one candidate submitted extent will be chosen - the one with the largest F1 (between submitted and ref extents). In case of tied F1 scores, the first in temporal sequence will be chosen.
2. When a copy is detected, how accurately the run locates the reference data in the test data.
3. How much elapsed time is required for query processing.
The following measures will be used:
1: Minimal Normalized Detection Cost Rate and PMiss -RFA plot
The detection effectiveness will be measured for each individual transformation. For each run, all results of individual transformations will be concatenated in separate files and sorted by decision score. Subsequently, each concatenated file (corresponding to a single transformation across all queries from a given run) will be used to compute the probability of a miss error and the false alarm rate (PMiss and RFA) at different operating points, by truncating the list at a range of decision thresholds θ, sweeping from the minimum decision score to the maximum score. As a first step, asserted copies that overlap will be logged and removed from consideration. Secondly, the computation of true positives will be based on only one submitted extent per query (as defined by the mapping procedure outlined above). All other submitted extents for this query count as false alarms. This procedure yields a list of pairs of increasing PMiss and decreasing RFA values. These data points will be used to create a PMiss versus RFA error plot (DET curve) for a given run and transformation. The two error rates are then combined into a single detection cost rate, DCR, by assigning costs to miss and false alarm errors:
CMiss and CFA are the costs of a Miss and a False Alarm, respectively,
PMiss and RFA are the conditional probability of a missed copy and the false alarm rate respectively;
PMiss= FN/(Ntarget) measured on the queries containing a copy (134 per transformation for the video-only queries)(micro-average),
RFA= FP/(Tqueries) measured on the full set of queries, where Tqueries is the total length of all queries (201 per transformation for the video-only queries),
Rtarget is the a priori target rate.
For this year, the parameters are defined as follows:
Rtarget = 0.5/hr ,
CMiss = 10
CFA = 1
There are two factors that the cost function separates: the prior for the likelihood of an event occurring and the application model that defines the relative importance of the two error types to the application. Using Rtarget in the formulas has the effect of equalizing the error types with respect to the priors. For instance in our case, the Rfas will be low compared to Pmiss, because the prior is so low for a target. Pmiss*Rtarget reduces the influence of the misses. Once the error rates are equalized, the costs set the relative importance of the error types to the modeled application.
In order to compare the detection cost rate values across a range of parameter values (e.g. with future CBCD tasks), DCR is normalized as follows:
The minimal normalized detection cost rate (as a function of the decision threshold θ) and associated decision threshold will be computed for each transformation, for each run.
2: Copy location accuracy
In the scenario for 2008 this separate measure is seen as a secondary, diagnostic. It aims to assess the accuracy of finding the exact extent of the copy in the reference video, once the system has correctly detected a copy. The detection accuracy will be measured for each individual transformation. The asserted and actual extents of the copy in the reference data will be compared using precision and recall and these two numbers will be combined using the F1 measure, recall and precision will be measured at the optimal operating point where the normalized detection cost is minimal.
3: Copy detection processing time
Mean time (in seconds) to process a query. Processing time is defined as the full time required to process queries from mpg files to result file (including decoding, analysis, features extraction, eventual write/read of intermediate results, eventual loadings of reference subsets, results output). Mean time is the full processing time normalized by the number of queries.
Since the evaluation plan is based on evaluating systems on a range of operating points, it is important that systems overgenerate, i.e. output multiple candidate copies for each query with associated decision scores. We strongly suggest that systems compute at least one copy candidate alignment result line for each query and most videos in the reference video dataset, unless the decision score for a certain query video pair is zero. In these cases, no result line is necessary. It is allowed to output multiple asserted copies within one reference video. Be aware though, that the false alarm rate is a heavy component in the evaluation. Note also that asserted copies may not overlap. Generating multiple asserted copies per reference video is therefore a matter of a recall precision trade-off.
A run is the output of a system (with a given set of parameters, training, etc) executed against all of the queries appropriate for the run type (video-only, audio-only, video+audio). A run will contain the following information in the following order, all in ASCII, one item per line unless otherwise indicated. The uppercase letter at the start of each line indicates the kind of data that follows. The maximum number of runs per type is 3.
T queryId elapsedQueryProcessingTimeInSeconds
Note 1: Time codes will be expressed using just digits (0-9) and one decimal point ("."), no other characters, and represent the number of elapsed seconds since the start of the reference or query video.
Note 2: Within a given result set for a given query, no two result items may overlap (i.e., have the same videoId and overlapping temporal extents. All such overlapping result items will be removed from consideration.
Submissions will be transmitted to NIST via this webpage starting on 30. July. Participants may experiment with the submisison form until 28. July. On 29. July all submissions will be deleted. Use your active participant's userid/password to access it. A Perl script, which looks for some common format errors, is available in the active participant's area. Any run which contains errors will be rejected by the submission webpage. Here is an initial brief sample of one system's baseline line run submission for the video-only queries task
Modifications to the submission requirements as stated in the previous version of the guidelines::
1. No optimal threshold will be submitted
2. Runs will not be submitted in high-recall / high-precision pairs.
3. No use of “NONE” to indicate no copy exists
4. Max number of runs will be 3 per query type (where types = video-only, audio-only, audio+video)
5. Multiple result lines per reference video allowed as long as temporal extents do not overlap