TRECVID 2013 Guidelines

Guidelines for TRECVID 2013

(last updated: )

0. Table of Contents:

Introduction
Video data
Data use agreements handled by NIST
Task - Semantic indexing (SIN)
Task - Interactive surveillance event detection (SED)
Task - Instance search (INS)
Task - Multimedia event detection (MED)
Task - Multimedia event recounting (MER)
Schedule
Outstanding 2013 guideline work items
Workshop notebook
Contacts, email lists

1. Introduction:

The main goal of the TREC Video Retrieval Evaluation (TRECVID) is to promote progress in content-based analysis of and retrieval from digital video via open, metrics-based evaluation. TRECVID is a laboratory-style evaluation that attempts to model real world situations or significant component tasks involved in such situations.

Up until 2010, TRECVID used test data from a small number of known professional sources - broadcast news organizations, TV program producers, and surveillance systems - that imposed limits on program style, content, production qualities, language, etc. In 2003 - 2006 TRECVID supported experiments in automatic segmentation, indexing, and content-based retrieval of digital video using broadcast news in English, Arabic, and Chinese. TRECVID also completed two years of pilot studies on exploitation of unedited video rushes provided by the BBC. In 2007 - 2009 TRECVID provided participants with cultural, news magazine, documentary, and education programming supplied by the Netherlands Institute for Sound and Vision. Tasks using this video included segmentation, search, feature extraction, and copy detection. Systems were tested in rushes video summarization using the BBC rushes. Surveillance event detection was evaluated using airport surveillance video provided by the UK Home Office. Many resources created by NIST and the TRECVID community are available for continued research on this data independent of TRECVID. See the Past data section of the TRECVID website for pointers.

In 2010 TRECVID confronted known-item search and semantic indexing systems with a new set of Internet videos (referred to in what follows as IACC) characterized by a high degree of diversity in creator, content, style, production qualities, original collection device/encoding, language, etc - as is common in much "Web video". The collection also has associated keywords and descriptions provided by the video donor. The videos are available under Creative Commons licenses from the Internet Archive. The only selection criteria imposed by TRECVID beyond the Creative Commons licensing is one of video duration - they are short (less than 6 min). In addition to the IACC data set, NIST began developing an Internet multimedia test collection (HAVIC) with the Linguistic Data Consortium and used it in growing amounts (up to 4000 h) in TRECVID 2010-present. The airport surveillance video, introduced in TRECVID 2009, has been reused each year since.

New in 2013 will be video provided by the BBC. Programming from their long-running EastEnders series will be used in the instance search task. An additional 600 h of Internet Archive video available under Creative Commons licensing for research (IACC.2) will be used for the semantic indexing task.

In TRECVID 2013 NIST will evaluate systems on the following tasks using the [data] indicated:

Semantic indexing [IACC]
Interactive surveillance event detection [i-LIDS]
Instance search [BBC EastEnders]
Multimedia event detection [HAVIC]
Multimedia event recounting [HAVIC]

2. Video data:

A number of datasets are available for use in TRECVID 2013 and are described below.

Once you know which tasks you will be participating in, you can determine which data sets you need.
Then for each needed dataset, see below for information on how you get permission to use the data and how it will be distributed..

IACC.2.A-C

Three datasets (A,B,C) - totaling approximately 7300 Internet Archive videos (144 GB, 600 h) with Creative Commons licenses in MPEG-4/H.264 format with duration ranging from 10 s to 6.4 min and a mean duration of almost 5 min. Most videos will have some metadata provided by the donor available e.g., title, keywords, and description

Data use agreements and Distribution: Download for active participants from NIST/mirror servers. See Data use agreements

Master shot reference: Will be available to active participants by download from the active participant's area of the TRECVID website.

Master I-Frames: Will be extracted by NIST using ffmpeg for all videos in the IACC.2.A collection and made available to active participants by download from the active participant's area of the TRECVID website.

Automatic speech recognition (for English): We hope to be able to provide this from the same source as in previous years for the IACC.1 collection.

IACC.1.A-C

Three datasets (A,B,C) - totaling approximately 8000 Internet Archive videos (160 GB, 600 h) with Creative Commons licenses in MPEG-4/H.264 format with duration between 10s and 3.5 min. Most videos will have some metadata provided by the donor available e.g., title, keywords, and description

Data use agreements and Distribution: Available by download from the Internet Archive. See TRECVID Past Data page. Or download from the copy on the Dublin City University server, but use the collection.xml files (see TRECVID past data page) for instructions on how to check the current availability of each file.

Master shot reference: Available by download from the TRECVID Past Data page

Automatic speech recognition (for English): Available by download from the TRECVID Past Data page

IACC.1.tv10.training

Approximately 3200 Internet Archive videos (50 GB<, 200 h) with Creative Commons licenses in MPEG-4/H.264 format with durations between 3.6 and 4.1 min Most videos will have some metadata provided by the donor available e.g., title, keywords, and description

Data use agreements and Distribution: Available by download from the Internet Archive. See TRECVID Past Data page.

Master shot reference: Available by download from the TRECVID Past Data page

Common feature annotation: Available by download from the TRECVID Past Data page

Automatic speech recognition (for English): Available by download from the TRECVID Past Data page

Gatwick and i-LIDS MCT airport surveillance video

The data consist of about 150 h of airport surveillance video data (courtesy of the UK Home Office). The Linguistic Data Consortium has provided event annotations for the entire corpus. The corpus was divided into development and evaluation subsets. Annotations for 2008 development and test sets are available.

Data use agreements and Distribution:

Gatwick development data (2008 DevSet and 2008 EvalSet) by download from password-protected servers at NIST and mirror sites. See Data use agreements
2009 i-LIDS test data from United Kingdom's Centre for Applied Science and Technology (CAST) can be downloaded from NIST but only after CAST has received the required information and issued a userid/password. See here for details.

Development data annotations: available by download.

BBC EastEnders

Approximately 244 video files (totally 300 GB, 464 h) with associated metadata, each containing a week's worth of BBC EastEnders programs in MPEG-4/H.264 format.

Data use agreements and Distribution: Download and fill out the data permission agreement from the active participants' area of the TRECVID website. After the agreement has been processed by NIST and the BBC, the applicant will be contacted by Dublin City Univerisity with instructions on how to download from their servers. See Data use agreements

Master shot reference: Will be available to active participants by download from TRECVID 2013 active participant's area.

Automatic speech recognition (for English): Will be available to active participants by download from Dublin City University.

HAVIC

HAVIC is a large collection of Internet multimedia constructed by the Linguistic Data Consortium and NIST. Participants will receive training corpora, event training resources, and two development test collections. Participants will also receive the ~4,000 h MED Progress evaluation collection which was used for MED '12 and will continue to be used through MED '15 as the test collection.

Data use agreements and Distribution: Data licensing and distribution will be handled by the Linguistic Data Consortium. See the Multimedia Event Detection task webpage for details.

3. Data use agreements handled by NIST (Gatwick (2008), IACC.2, BBC EastEnders)

In order to be eligible to receive the data, you must have have applied for participation in TRECVID. Your application will be acknowledged by NIST with a team ID, and active participant's password, and information about how to obtain the data.

If you will be using i-LIDS (2009) or HAVIC data, NIST will NOT be handling the data use agreements. See the "Data Use Agreements and Distribution" section for i-LIDS or HAVIC.

If you will be using IACC.1 video, the data use agreements are available from the "Past data" webpage. You will be downloading the data from the Internet Archive. See the "Data Use Agreements and Distribution" section for IACC.1

If you will be using Gatwick(2008), IACC.2, or BBC EastEnders, data you will need to complete the relevant permission forms (from the active participant's area) and email the scanned page images for each form as one Adobe Acrobat pdf of the document to In your email include the following:
```
As Subject: "TRECVID data request"
In the body: your name
             your short team ID (given when you applied to participate)
             the kinds of data you will be using - one or more of the following:
	     Gatwick (2008), IACC.2, and/or BBC EastEnders 
```
You will receive instructions on how to download the data.

Please ask only for the test data (and optional development data) required for the task(s) you apply to participate in and intend to complete.

Requests are handled in the order they are received. Please allow 3 business days for NIST to respond to your request with the access codes you need to download the data using the information about data servers in the the active participant's area.

4. Semantic indexing:

Task coordinator: Georges Quénot

This task will be coordinated by Georges Quénot - with Han Dong from the Laboratoire d'Informatique de Grenoble and Stéphane Ayache from the Laboratoire d'Informatique Fondamentale de Marseille using support from the Quaero Programme and in collaboration with NIST. Cees Snoek and Xirong Li from University of Amsterdam will participate in the selection of the concept pairs for the paired-concept task.

Automatic assignment of semantic tags representing visual or multimodal concepts (previously "high-level features") to video segments can be fundamental technology for filtering, categorization, browsing, search, and other video exploitation. New technical issues to be addressed include methods needed/possible as collection size and diversity increase, when the number of concepts increases, and when concepts are related by an ontology. In 2013 the task will again support experiments in two areas introduced in 2012:

A no annotation condition for the main task: the idea is to promote the development of methods that permit the indexing of concepts in video shots using only data from the Web or archives without the need of additional annotations. The training data could for instance consist of images retrieved by a general purpose search engine (e.g. Google) using only the concept name and/or definition with only automatic processing of the returned images. This will not be implemented as a new variant of the task but by using additional categories for the training types besides the A to D ones (see below). By "no annotation", we mean here that no annotation should be manually done on the retrieved samples (either images or videos). Any annotation done by somebody else prior to the general search does not count. Methods developed in this context could be used for building indexing tools for any concept starting only from a name and a definition for it or from a simple query defined for it.
A concept pair task will be available in addition to the main one. It will consist in the detection of pairs of unrelated concepts instead of the detection of simple concepts. Note that concepts must both appear in at least one frame of a shot in order for the shot to be considered as containing the pair. The idea is to promote the development of methods for retrieving shots containing a combination of concepts that do better than just combining the output of individual concept detectors. To this end, each participating group will be required to submit a baseline run, which just combines for each pair the output of group's two independent single-concept detectors. It is a goal of the task to see if this baseline can be surpassed. The set of concept pairs are as listed below.
NEW! In addition, as an experiment, there will be an optional system output for each shot returned, which holds the component concept number (listed below) of the concept which appears first in the shot; a value of 0 for this output will indicate that the two concepts first appear simultaneously. Note that this does not change the rule that both concepts must occur together in at least one frame of the shot. The submission format will be updated by 1. May to reflect this addition. Results for this optional output will be returned and reported at the workshop, but will not affect the primary measures for the task.
```
   [911]  Telephones (117) + Girl (54)
   [912]  Kitchen (72) + Boy (16)
   [913]  Flags (261) + Boat_Ship (15)
   [914]  Boat_Ship (15) + Bridges (17)
   [915]  Quadruped (392) + Hand (59)
   [916]  Motorcycle (80) + Bus (19)
   [917]  Chair (25) + George_[W_]Bush(274)
   [918]  Flowers (53) + Animal (6)
   [919]  Explosion_Fire (49) + Dancing (38)
   [920]  Government-Leader (56) + Flags (261)
```

In addition, the semantic indexing task will introduce two new opportunities for experimentation:

An optional localization subtask will challenge systems participating in the main task to make their concept detection more precise in time and space. Currently systems are accurate to the level of the shot. In the localization subtask, systems will be asked to determine the presence of the concept temporally within the shot, i.e., with respect to a subset of the frames comprised by the shot, and, spatially, for each such frame that contains the concept, to a bounding rectangle. The localization will be restricted to 10 concepts from those chosen used in the main task.

Measurement of system progress for a fixed set of concepts and independent of the test data, across 3 years (2013-2015), will be supported as an option. In 2013 researchers can deposit at NIST the 3 sets of results from their 2013 system, run and submitted independently on just IACC.2.A, on just IACC.2.B, and on just IACC.2.C. In 2014 they can deposit the 2 sets of results from their 2014 system, run and submitted independently on just IACC.2.B and just IACC.2.C. And in 2015 they can deposit results from their 2015 system, run and submitted independently on and just IACC.2.C. The evaluation at NIST in 2014 can then compare performance of the 2013 and 2014 systems for a selected set of concepts and on the same IACC.2.B data. Similarly in 2015 for comparison of the 2013, 2014, and 2015 systems on the IACC.2.C video. Compete with yourself!

System tasks:

Main: Given the test collection (IACC.2.A), master shot reference, and single concept definitions, return for each target concept a list of at most 2000 shot IDs from the test collection ranked according to their likelihood of containing the target.

Localization subtask: For each concept from the list of 10 designated for localization, for each shot of the top-ranked 1000 returned in a main task run, for each I-Frame within the shot that contains the target, return the x,y coordinates of the upper left and lower right vertices of a bounding rectangle which contains all of the target concept and as little more as possible. Systems may find more than one instance of a concept per I-Frame and then may include more than one bounding box for that I-Frame, but only one will be used in the judging since the ground truth will contain only 1 per judged I-Frame, one chosen by the NIST assessor, at least in this first round.

Progress: The progress task is just the main task run additionally, independently on the progress data sets for 2013: just IACC.2.B and just IACC.2.C. (No localization)

Paired: Given the test collection (IACC.2.A), master shot reference, and concept-pair definitions, return for each target concept pair a list of at most 2000 shot IDs from the test collection ranked according to their likelihood of containing the target. In 2013 each participant in the paired concept task must submit a baseline run which just combines for each pair the output of group's two independent single-concept detectors. Optionally, participants may submit information for all pairs, indicating the temporal sequence in which the two concepts occur. Note this does not change the requirement that both concepts must occur within at least one frame in the shot in order for the shot to be considered as containing the concept.

Data:

The current test data set (IACC.2.A) will be 200 h drawn from the IACC.2 collection using videos with durations between 10 s and 6 min.

The progress test data sets (IACC.2.B-C) will be 2 additional non-overlapping collections of 200 h (IACC.2.B and IACC.2.C) each drawn randomly from the IACC.2 collection.

The development data set combines the development and test data sets of the 2010 and 2011 issues of the task, IACC.1.tv10.training, IACC.1.A, IACC.1.B, and IACC.1.C each containing about 200 h drawn from the IACC.1 collection using videos with durations ranging from 10s to just longer than 3.5 min. These datasets can be downloaded from the Internet Archive using information available on the TRECVID "past data" webpage.

Concepts and relations:

500 concepts were selected for the TRECVID 2011 semantic indexing task. In making this selection, the organizers drew from the 130 used in TRECVID 2010, the 374 selected by CU/Vireo for which there exist annotations on TRECVID 2005 data, and some from the LSCOM ontology. From these 500 concepts, 346 concepts were selected for the full task in 2011 as those for which there exist at least 4 positive samples in the final annotation. A spreadsheet of the concepts is available here with complete definitions and an alignment with CU-VIREO374 where appropriate. [Don't be confused by the multiple numberings in the spreadsheet - use the TV13-15 IDs in the concept lists below under "Submissions".] For 2013 the same list of 500 concepts has been used as a starting point for selecting the 60 single concepts for which participants must submit results in the main task and the 10 concept pairs in the paired concept task. The concepts (number to be determined) for localization will be a subset of the main task concepts - perhaps about 10.

The organizers have provided again a set of relations between the concepts. There are two types of relations: A implies B and A excludes B. Relations that can be derived by transitivity will not be included. Participants are free to use the relations or not and submissions are not required to comply with them.

It is expected that advanced methods will use the annotations of non-evaluated concepts and the ontology relations to improve the detection of the evaluated concepts. The use of the additional annotations and of ontology relations is optional and comparison between methods that use them and methods that do not is encouraged.

Collaborative annotation:

collaborative annotations

Common non-annotated training data:

The organizers will try to provide a common list of image URLs retrieved by issuing a query to a general purpose search engine for each target concept so that experiments in this category can be carried out using common non-annotated training data. This should remove the need for participants to do this preliminary work and should ease the comparison between methods by using the same collected data. This does not prevent other participants to do their own gathering of non-annotated training data.

Sharing of components:

wiki

Submission types:

Four types of submissions will be considered:

"main" in which participants will be ready to provide a result for these 60 single concepts drawn from the 346 used in TRECVID 2012

"loc" in which main task participants will also submit localization results for a subset of the 60 main concepts - see items marked with **.

"progress" in which participants will be ready to provide independent results for all and only the 60 main task concepts but against the IACC.2.A, IACC.2.B, and IACC.2.C data.

"pair" in which participants will be ready to submit results for 10 concept pairs

Optionally, information on the time sequence in which the two concepts appears may be submitted.

Training types:

P l e a s e n o t e t h e s e r e s t r i c t i o n s and this information on training types. The submission types (main and pair) are orthogonal to the training types (A, B, C ...).

Submissions:

Each team may submit a maximum of 4 prioritized main runs with 2 additional if they are of the "no annotation" training type and the others are not. One localization run may be submitted with each main submission. Each team may also submit up to 2 "pair" runs. Each team may submit up to 2 progress runs on each of the 2 progress datasets. The submission formats are described below.

Please note: Only submissions which are valid when checked against the supplied DTDs will be accepted. You must check your submission before submitting it. NIST reserves the right to reject any submission which does not parse correctly against the provided DTD(s). Various checkers exist, e.g., Xerces-J: java sax.SAXCount -v YourSubmision.xml.

Participants in the main version of the task against IACC.2.A will submit real results in each run for all and only the 60 selected concepts and for each concept at most 2000 shot IDs.

Progress runs (against IACC.2.A, IACC.2.B,and IACC.2.C) must each be submitted as a separate run and in any given year (e.g. 2013) the single test system used in all the runs (e.g. against IACC.2.A, B, and C) must be FROZEN before any runs for submission are begun.

Participants in the concept pair version of the task will submit real results for the 10 selected concept pairs and for each concept at most 2000 shot IDs. See the videoFeatureExtractionRunResult DTD for infomation on how optionally to encode the temporal sequence in which the overlapping concepts occur. Particpants must provide the sequence information for all concept pairs or none.

Participants in the localization subtask will submit in one file per run, the localization data for all and only the concept-containingI-Frames in at most the top 1000 shots submitted in the main task - but for all and only the subset of main concepts specified for localization. A standard set of I-Frames, grouped by each master shot and test video file, will be extracted by NIST using ffmpeg and made available to participants. Each line of submitted localization data will contain the following in groups of ASCII characters separated by 1 space. X and Y coordinates refer to the bounding rectangle. Assume the UpperLeft point in each frame image has coordinates (0,0), LowerRightX > UpperLeftX, LowerRightY > UpperLeftY.
```
Concept#
|   File# Frame#
|   |     |    UpperLeftX 
|   |     |    |   UpperLeftY
|   |     |    |   |   LowerRightX
|   |     |    |   |   |   LowerRightY 
|   |     |    |   |   |   |
xxx xxxxx xxxx xxx xxx xxx xxx
```
Here for download (though they may not display properly) is a DTD for concept (feature) extraction results of one main, paired or progress run, one for results from multiple runs, and a small example of what a site would send to NIST for evaluation. The localization information will be submitted in a flat file (no xml tags), the format of which will be determined before the Guidelines are final. Please check all your submissions to see that they are well-formed.
You may submit all your runs in one or multiple files as long as you do not break a run across files. EACH file you submit should begin, as in the example submission, with the DOCTYPE statement that refers to the the DTD at NIST via a URL and with a videoFeatureExtractionResults element even if only one run is included. Each submitted file must be compressed using just one of the following: gzip, tar, zip, or bzip2.
Remember to use the correct shot IDs in your submissions - the ones from video segment elements the mp7 files in the master shot reference with the format shotFILENUMBER_SHOTNUMBER. Do not use the ID associated with the video (TRECVID2013_FILENUMBER) or the one associated with a keyframe (shotFILENUMBER_SHOTNUMBER_RKF).
Submissions will be transmitted to NIST via this webpage, which requests the TRECVID active participant's userid and password. It is currently set up for testing only - use the "TEST" organization; all TEST submissions will be checked and feedback provided but then discarded.

Evaluation:

A subset of the submitted concept results (at least 20), to be announced only after the submission date, will be evaluated by assessors at NIST or at LIG using pooling and sampling.

Please note that NIST uses a number of rules in manual assessment of system output.

Measures (indexing):

Mean extended inferred average precision (mean xinfAP), which allows sampling density to vary e.g. so that it can be 100% in the top strata, which are most important for average precision.
As in past years, other detailed measures based on recall, precision will be provided by the trec_eval software.

Measures (localization): Temporal and spatial localization will be evaluated using precision and recall based on the judged items at two levels - the frame and the pixel, respectively. NIST will then calculate an average for each of these values for each concept and for each run.
For each shot that is judged to contain a concept, a subset of the shot's I-Frames will be viewed and annotated to locate the pixels representing the concept. The set of annotated I-Frames will then be used to evaluate the localization for the I-Frames submitted by the systems.

For each run, a total elapsed time in seconds will be reported.

Issues:

How can we encourage use of the (interframe) video context in improving localization over that done for isolated images? For example, assuming interframe context decreases with growing interframe time interval, should we make the sample of I-Frames we decide to judge for a given shot-concept "contiguous" in time?

5. Interactive surveillance event detection:

Task coordinator: Jon Fiscus

Detecting human behaviors efficiently in vast amounts surveillance video is fundamental technology for a variety of higher-level applications of critical importance to public safety and security. The use case addressed by this task is the retrospective exploration of surveillance video archives using a system designed to support the optimal division of labor between a human user and the software - an interactive system.

System task:

Given a collection of surveillance data files (e.g. that from an airport, or commercial establishment) for preprocessing, at test time take a small set of topics (search requests for known events) and for each return the elapsed search time and a list of video segments within the surveillance data files, ranked by likelihood of meeting the need described in the topic. Each search for an event by a searcher can take no more than 25 elapsed minutes, measured from the time the searcher is given the event to look for until the time the result set is considered final.

Data:

The test data will be the same as was used in the SED task in 2012.

Submissions:

Submissions will follow the same format and procedure as in the SED 2012 task. The number of submissions allowed will be determined by the time the Guidelines are final. Participants must submit at least one interactive run. An automatic version of each interactive run for comparison may also be submitted.

Evaluation:

It is assumed the user(s) are system experts and no attempt will be made to separate the contribution of the user and the system. The results for each system+user will be evaluated by NIST as to effectiveness - using standard search measures (e.g., probability of missed detection/false alarm, precision, recall, average precision)- self-reported speed, and user satisfaction (for interactive runs).

6. Instance search:

Task coordinator: Paul Over

An important need in many situations involving video collections (archive video search/reuse, personal video organization/search, surveillance, law enforcement, protection of brand/logo use) is to find more video segments of a certain specific person, object, or place, given a visual example.

In 2013 NIST will create about 2 dozen topics. Automatic and interactive systems will use the same set. New data will be available from the BBC EastEnders television series, prepared with major help from several participants in the AXES project (access to audiovisual archives), a four-year FP7 framework research project to develop tools that provide various types of users with new and engaging ways to interact with audiovisual libraries.

System task:

Given a collection of test video, a master shot reference, and a collection of topics (queries) that delimit a person, object, or place entity in some example video, locate for each topic up to the 1000 shots most likely to contain a recognizable instance of the entity. Interactive runs will likely return many fewer than 1000 shots.

Data:

Development data: A very small sample of the BBC Eastenders test data will be available from Dublin City University. No actual development data will be supplied.

Test data: The test data for 2013 will be BBC EastEnders video in MPEG-4 format. The example images in the topics will be in bmp format. See above for information on how to get a copy of the test data.

Topics: Each topic will consist of a set of 4 example frame images (bmp) drawn from test videos containing the item of interest. The shots from which example images are drawn for a given concept, will be filtered by NIST from system submissions for that concept before evaluation. For each frame image a binary mask of the region of interest will be provided. Each topic will also include an indication of the target type taken from this set of strings {PERSON, LOCATION, OBJECT}

Auxilliary data: Participants are allowed to use various publicly available EastEnders resources as long as they carefully note the use of each such resource by name in their workshop notebook papers. They are strongly encouraged to share information about the existence of such resources with other participants via the tv13.list as soon as they discover them.

Submissions:

Each team may submit a maximum of 4 prioritized runs. All runs will be evaluated but not all may be included in the pools for judgment. Submissions will be identified as either fully automatic or interactive. Interactive runs will be limited to 15 elapsed minutes per search.

Here for download (though they may not display properly) is the DTD for search results of one run, the one for results from multiple runs, and a small example of what a site would send to NIST for evaluation. Please check your submission to see that it is well-formed

You may submit all your runs in one or multiple files as long as you do not break a run across files. EACH file you submit should begin, as in the example submission, with the DOCTYPE statement and a videoSearchResults element even if only one run is included.

Submissions must be transmitted to NIST via this webpage.

Evaluation:

This task will be treated as a form of search and will accordingly be evaluated with average precision for each topic in each run and per-run mean average precision over all topics. Speed will also be measured.

7. Multimedia event detection:

Task coordinator: Jon Fiscus

Video is becoming a new means of documenting everything from recipes to how to change a tire of a car. Ever expanding multimedia video content necessitates development of new technologies for retrieving relevant videos based solely on the audio and visual content of the video clip.

System task:

Participants are tasked with building an automated system that can determine whether an event is present anywhere in a video clip using the content of the video clip only. The system inputs a set of "search" videos and an "event kit" (text and video describing the event). The system computes an "event score" (that gives the strength of evidence of the event) and an optional "recounting" of the event in the each search video in the input set.

The 2013 evaluation will consist of 20 "pre-specified" and 20 new "ad-hoc" event kits containing 100, 10 or 0 example event videos. Developers will receive the pre-specified event kits in advance of building metadata generator (which extracts and saves a representation for each video's content). Once the metadata generator has been locked and the metadata is created for the search video set, developers will process the pre-specified and ad-hoc events.

Participants may choose to either run their system on a set of 98,000 search videos or a subset containing approximately 32,000 video clips.

All participants must submit the results from their system that: (1) processes all 20 pre-specified, 100-example event kits and (2) processes either the evaluation search video set or its specified subset.

Participants choosing to process the 20 ad-hoc events must submit results from their system that: (1) processes all 20 ad-hoc, 100-example event kits and (2) processes either the evaluation search video set or its specified subset.

Participants may optionally submit results from their system that use fewer event kit video examples (i.e., 10 and 0 examples). Participants may also optionally submit partial results of their primary system (using 100 event examples) for one or more pre-defined conditons:

event detection using optical text recognition (OCR) only
event detection using automatic speech recognition (ASR) only,
event detection using video only (excluding OCR), and
event detection using non-speech audio (excluding ASR) only.

Data:

Participants will be provided new data divisions to facilitate cross-team comparison of results. The four data divisions will include research data, training data, an evaluation search video set, and two test search video sets.

Research data: Participants will be provided data and research event kits to develop algorithms and systems to design their video content metadata representation. This data may be augmented, annotated or altered to support each participant's research and development of their algorithms. However, none of this data may be used directly for training pre-specified or ad-hoc event classifiers to be used for the system.

Training data: This data consists of the event kit example videos and background training videos. The event kit text will identify required and supporting observable evidence of an event. This information can be used to inform the metadata representation and to design algorithms. This data is the only data to be used to train event classifiers. However, this data must NOT be extended, annotated, nor altered.

Test search video sets: Participants will be provided two search video sets for testing and publication of results. These videos cannot be used in event training nor defining/implementing the metadata representation or extraction. This data must NOT be extended, annotated nor altered.

Evaluation Search Set: This data will be used for blind testing of participants' system and will be the Progress Set (4,000 h of search videos used in MED'12) and a pre-defined Progress Subset (a 1,300 h subset of the Progress Set). This data must be kept "blind" and teams must not view or analyze any properties of the video set.

Submissions:

Please refer to 2013 MED evaluation plan for instructions on the submission process.

Evaluation:

Event Detection performance will be assessed using a Mean Average Precision as the primary evaluation measure. Detection threshold setting will be evaluated using a to-be-determined measure. NIST will provide the ground truth and evaluation tools. All participants will report their results in their TRECVID workshop notebook papers.

Details:

Please refer to the 2013 MED evaluation plan for instructions on the submission process.

Open issues

Define the measure of quality of threshold setting.

8. Multimedia event recounting:

Task coordinator: Greg Sanders

A MED system not only detects events in video clips but also recounts the evidences used to identify the event. These recountings help the user quickly and accurately locate their event of interest within the clips detected by the MED system.

System task:

The system task is to provide a recounting of the important evidence that a video clip contains an instance of an event of interest. For each piece of evidence, the recounting must include both text summarizing the evidence and a list of one or more spatiotemporal pointers (in a text format) locating the evidence within the clip. NIST will provide a DTD for the XML format.

MER participants are the participants in the MED evaluation whose MED submission also includes a recounting for each clip. If no evidence of the event is found by the MED system, the recounting will include no evidence. The recountings to be evaluated will be selected from

Pre-specified event kits with 100 exemplars
Clips from the Progress subset

Additional details of the evaluation procedures are in the 2013 MER Evaluation Specification Documents, which will appear in the near future and before the Guidelines are final.

Data:

See MED data section.

Submissions:

MER evaluates the recountings submitted with each participant’s MED results.

Submissions will be XML-compliant plain-ASCII text. A MER DTD and triage workstation will be made available to all participants.

At most one submission per participating MED team will be accepted. Due to limits on manual judging resources, NIST may need to restrict the number of submissions that are judged. If so, this will be based on the performance of the associated MED system and order of MER submission (first come, first served).

Please refer to the 2013 MER Evaluation Specification Document, when complete, for instructions on the submission process.

Evaluation:

The system's recountings will be evaluated by a panel of judges. Each judge will be provided with a "triage workstation," which will take as input the event kit (or query for the event of interest), the system's recountings that are being assessed, and the clips associated with each of those recountings. NIST will choose a subset of clips for which recountings will be judged for selected events. All submissions will be judged on the same set of clips and events. For each such event, the judges will try to (as rapidly as possible) use the recountings to find clips that contain an instance of the event of interest.

In the triage workstation, the judge will be shown all the pieces of evidence in the recounting, both the textual and the audiovisual via the spatiotemporal pointers. Each spatiotemporal pointer will have a temporal part that specifies the period of time in the clip (the snippet or excerpt) where the piece of evidence occurs. For pieces of evidence that are visual, the spatiotemporal pointer will also include a bounding box within the video frame at the beginning and at the end of the snippet. After reading the recounting and hearing/viewing all the snippets for it, the judge will indicate whether he/she believes the associated clip contains an instance of the event of interest; the judge will have three choices (1) does contain, (2) does not contain, or (3) cannot readily tell whether it contains an instance of the event of interest (the difficulty could lie in the recounting or in the event kit).

For each submission, for each event, NIST will determine how quickly and accurately the judge is able to use the recounting to determine if the clip contains the event. Submissions whose recountings enable the judges to perform that task the most rapidly and accurately will be considered the best.

All participants will report their results in their TRECVID workshop notebook papers.

Details:

The latest details will be available soon and before the Guidelines are final on the MER webpage.

9. Schedule:

The following are the target dates for 2013:

------------
1. Feb: NIST sends out Call for Participation in TRECVID 2013
21. Feb: Applications for participation in TRECVID 2013 due at NIST
------------
1. Mar: Final versions of TRECVID 2012 papers due at NIST; Development data available for most if not all tasks; MED: Sites receive Pre-Specified Event Kit disk drive (for new participants)
------------
1. Apr: Guidelines complete
------------
5. Jun: MED: Sites receive Progress Set disk drives (for new participants); SIN: Submission webpage open for trial submissions (highly recommended); SIN: Submission webpage open for final official submissions
INS: Instance search topics available from NIST
------------
1. Jul: MED: Dry Run begins
SED: Dry run starts
10. Jul: SIN: Semantic indexing task submissions due at NIST for evaluation.
16. Jul: SIN: Unevaluated semantic indexing submissions available for active participants
17. Jul - 22. Aug: SIN: Semantic indexing assessment at NIST
31. Jul: MED/MER: Dry run ends
SED: Dry run evaluation ends
------------
9. Aug: INS: submission webpage open for trial submissions (highly recommended)
19. Aug: INS: submission webpage open for final official submissions
23. Aug: INS: instance search submissions due at NIST
30. Aug: SIN: evaluation results returned
------------
1. Sep: TRECVID Workshop general and registration information
3. Sep - 25. Sep: INS: Instance search task assessment at NIST
10. Sep: MED/MER: Sites submit Pre-Specified (PS) event runs (with optional recountings)
MED: Sites receive Ad Hoc event (AH) kit disk drive
13. Sep: MED: NIST releases pre-adjudication PS scores
MED: Release of AH event kits
23. Sep: MED: Sites submit AH event runs
27. Sep: MER: NIST releases pre-adjudication event recounting results
INS: Instance search evaluation results returned
MED: NIST releases preliminary AH scores
------------
1. Oct: SED: Surveillance event detection task submissions due at NIST
4. Oct: Speaker proposals due at NIST by noon (Gaithersburg time)
SED: Final results returned to participants
MER: Judging on event recountings complete
11. Oct: MED: NIST releases PS and AH post-adjudicated results
MER: NIST releases post-adjudicated results
18. Oct: Agenda posted
22. Oct: Workshop notebook papers due at NIST (no late submissions allowed)
------------
11. Nov: Copyright forms due back at NIST (see Notebook papers for instructions)
12. Nov: TRECVID 2013 Workshop registration closes at 5:00PM EST
20-22. Nov (Wed.-Fri): TRECVID Workshop (2.5 days) at NIST in Gaithersburg, MD
----------------
1. Mar 2014: Final versions of TRECVID 2013 papers due at NIST

10. Outstanding 2013 guideline work items

Here is a list of work items that must be completed before the guidelines are considered to be final..

MED Evaluation Plan final
MER` Evaluation Plan final

11. Workshop notebook

Workshop notebook

Agreement Concerning Dissemination of TREC/TRECVID Results

12. Contacts, email lists:

TRECVID project leader:

General Coordinators:

Alan Smeaton (CLARITY: Centre for Sensor Web Technologies, Dublin City University)

(TNO, Radboud University Nijmegen)

Task coordinators:

See the guidelines section for the task of interest.

Email lists:

Information and discussion for active workshop participants

[email protected]
archive open to active participants only
NIST will subscribe the contact listed in your application to participate when we have received it. Additional members of active participant teams will be subscribed by NIST if they send email to indicating they want to be subscribed, the email address to use, their name, and providing the TRECVID 2013 active participant's password. Groups may combine the information for multiple team members in one email. Please use this format for submitting each name and email address (one name+address per line):
```
Pat Doe <[email protected]>
```
Once subscribed, you can post to this list by sending you thoughts as email to [email protected], where they will be sent out to EVERYONE subscribed to the list, i.e., all the other active participants.

Information and discussion on the surveillance event detection task

[email protected]
If you would like to subscribe, see the event detection webpage for contact information.

Information and discussion on the multimedia event detection task.
- [email protected]
- If you would like to subscribe, see the MED webpage for contact information.

National Institute of
Standards and Technology Home

Last updated:
Date created: Tuesday, 4-Jan-13
For further information contact