The main goal of the TREC Video Retrieval Evaluation (TRECVID) is to promote progress in content-based analysis of and retrieval from digital video via open, metrics-based evaluation. TRECVID is a laboratory-style evaluation that attempts to model real world situations or significant component tasks involved in such situations.
Up until 2010, TRECVID used test data from a small number of known professional sources - broadcast news organizations, TV program producers, and surveillance systems - that imposed limits on program style, content, production qualities, language, etc. In 2003 - 2006 TRECVID supported experiments in automatic segmentation, indexing, and content-based retrieval of digital video using broadcast news in English, Arabic, and Chinese. TRECVID also completed two years of pilot studies on exploitation of unedited video rushes provided by the BBC. In 2007 - 2009 TRECVID provided participants with cultural, news magazine, documentary, and education programming supplied by the Netherlands Institute for Sound and Vision. Tasks using this video included segmentation, search, feature extraction, and copy detection. Systems were tested in rushes video summarization using the BBC rushes. Surveillance event detection was evaluated using airport surveillance video provided by the UK Home Office. Many resources created by NIST and the TRECVID community are available for continued research on this data independent of TRECVID. See the Past data section of the TRECVID website for pointers.
In 2010 TRECVID confronted known-item search and semantic indexing systems with a new set of videos (referred to in what follows as IACC.1) characterized by a high degree of diversity in creator, content, style, production qualities, orginal collection device/encoding, language, etc - as is common in much "web video". The collection also has associated keywords and descriptions provided by the video donor. The videos are available under Creative Commons licenses from the Internet Archive. The only selection criteria imposed by TRECVID beyond the Creative Commons licensing will be one of video duration - they arel short (less than 4 minutes). In addition to the IACC.1 data set, NIST began developing an Internet multimedia test collection (HAVIC) with the Linguistic Data Consortium and used it in growing amounts in TRECVID 2010 and 2011. The airport surveillance video used in TRECVID 2009 was used again in 2010 and 2011 as was some of the Sound and Vision (S&V) videos used in 2009.
TRECVID 2012 will continue 5 tasks from 2011 with variations and add one (multimedia event recounting). NIST will evaluate systems on the following tasks using the [data] indicated::
A number of datasets are available for use in TRECVID 2012 and are described below.
Approximately 8000 Internet Archive videos (50GB, 200 hours) with Creative Commons licenses in MPEG-4/H.264 with duration between 10 seconds and 3.5 minutes. Most videos will have some metadata provided by the donor available e.g., title, keywords, and description
Data use agreements and Distribution: Download for active participants from NIST/mirror servers. See Data use agreements
Master shot reference: Will be available to active participants by download from NIST
Automatic speech recognition (for English): Looking for someone willing to provide this.
Approximately 8000 Internet Archive videos (50GB, 200 hours) with Creative Commons licenses in MPEG-4/H.264 with duration between 10 seconds and 3.5 minutes. Most videos will have some metadata provided by the donor available e.g., title, keywords, and description
Data use agreements and Distribution: Download for active participants from NIST/mirror servers. See Data use agreements
Master shot reference: Available by download from the TRECVID Past Data page
Automatic speech recognition (for English): Available by download from the TRECVID Past Data page
Approximately 8000 Internet Archive videos (50GB, 200 hours) with Creative Commons licenses in MPEG-4/H.264 with duration between 10 seconds and 3.5 minutes. Most videos will have some metadata provided by the donor available e.g., title, keywords, and description
Data use agreements and Distribution: Download for active participantsfrom NIST server. See Data use agreements
Master shot reference: Available by download from the TRECVID Past Data page
Automatic speech recognition (for English): Available by download from the TRECVID Past Data page
Approximately 3200 Internet Archive videos (50GB, 200 hours) with Creative Commons licenses in MPEG-4/H.264 with durations between 3.6 and 4.1 minutes Most videos will have some metadata provided by the donor available e.g., title, keywords, and description
Data use agreements and Distribution: Download for active participants from NIST server. See Data use agreements
Master shot reference: Available by download from the TRECVID Past Data page
Common feature annotation: Available by download from the TRECVID Past Data page
Automatic speech recognition (for English): Available by download from the TRECVID Past Data page
Rushes are the raw material from which programs/films are made in the editing room. Rushes that were used in TRECVID 2011 included those for several dramatic series as well as for travel programing. The videos have been segmented at arbitrary points to produce a collection of about 21,000 equal-length, short clips.
Data use agreements and Distribution: Download from NIST server. See Data use agreements
Video available under a Creative Commons license for research only from Flickr.com will be available in webm format. Each shot will be a separate webm file.
Data use agreements and Distribution: Download for active participants from NIST server. See Data use agreements
The data consist of about 150 hours of airport surveillance video data (courtesy of the UK Home Office). The Linguistic Data Consortium has provided event annotations for the entire corpus. The corpus was divided into development and evaluation subsets. Annotations for 2008 development and test sets are available.
Data use agreements and Distribution:
Development data annotations: available by download.
HAVIC is designed to be a large new collection of Internet multimedia and is being constructed by the Linguistic Data Consortium and NIST. A new, ~4,000 hour Progress Test collection will be provided to participants and used through MED '15 as the test collection.
Data use agreements and Distribution:
Data licensing and distribution will be handled by the Linguistic Data Consortium. See the Multimedia Event Detection task webpage for details.
In order to be eligible to receive the data, you must have have applied for participation in TRECVID. Your application will be acknowledged by NIST with a team ID, and active participant's password, and information about how to obtain the data.
If you will be using i-LIDS (2009) or HAVIC data, NIST will NOT be handling the data use agreements. See the "Data Use Agreements and Distribution" section for i-LIDS and HAVIC
If you will be using Gatwick(2008), IACC, Sound and Vision, BBC, or Flickr data you will need to complete the relevant permission forms and email the scanned page images as one Adobe Acrobat pdf of the document to In your email include the following:
As Subject: "TRECVID data request" In the body: your name your short team ID (given when you applied to participate) the kinds of data you will be using - one or more of the following: S&V, Gatwick (2008), IACC.1, and/or BBC
Please ask only for the test data (and optional development data) required for the task(s) you apply to participate in and intend to complete.
Requests are handled in the order they are received. Please allow 3 business days for NIST to respond to your request with the access codes you need to download the data using the information about data servers in the active participant's area.
This task will be coordinated by Georges Quénot - with Franck Thollard from the Laboratoire d'Informatique de Grenoble and Stéphane Ayache from the Laboratoire d'Informatique Fondamentale de Marseille using support from the Quaero Programme and in collaboration with NIST. Cees Snoek and Xirong Li from University of Amsterdam proposed the "concept pair" variant of the task and will participate in the selection of the pairs.
Automatic assignment of semantic tags representing visual or multimodal concepts (previously "high-level features") to video segments can be fundamental technology for filtering, categorization, browsing, search, and other video exploitation. New technical issues to be addressed include methods needed/possible as collection size and diversity increase, when the number of concepts increases, and when concepts are related by an ontology.
The task will remain the same as in 2010 and 2011 but, considering the feedback from the poll about the 2011 issue of the task (see the SIN 2011 overview slides), we shall pause this year in the increase of the number of concepts to be processed. Slight adjustments will be made to the concept lists but the counts will remain comparable.
Also considering some feedback from the poll that pointed a lack of novelty and considering suggestions in this direction, two novelties will be proposed as pilot extensions to the participants in 2012:
[901] Beach + Mountain [902] Old_People + Flags [903] Animal + Snow [904] Bird + Waterscape_waterfront [905] Dog + Indoor [906] Driver + Female_Human_Face [907] Person + Underwater [908] Table + Telephone [909] Two_People + Vegetation [910] Car + bicycle
Given the test collection, master shot reference, and concept definitions, return for each concept a list of at most 2000 shot IDs from the test collection ranked according to their likeliness of containing the concept.
The test data set (IACC.1.C) will be 200 hours drawn from the IACC.1 collection using videos with durations between 10 seconds and 3.5 minutes.
The development data set combines the development and test data sets of the 2010 and 2011 issues of the task, IACC.1.tv10.training, IACC.1.A and IACC.1.B, each containing 200 hours drawn from the IACC.1 collection using videos with durations ranging from 10s to just longer than 3.5 minutes.
500 concepts have been selected for the TRECVID 2011 semantic indexing task. In making this selection, the organizers have drawn from the 130 used in TRECVID 2010, the 374 selected by CU/Vireo for which there exist annotations on TRECVID 2005 data, and some from the LSCOM ontology. From these 500 concepts, 346 concepts were selected for the full task in 2011 as those for which there exist at least 4 positive samples in the final annotation. A spreadsheet of the concepts is available here with complete definitions and an alignment with CU-VIREO374 where appropriate. [Don't be confused by the multiple numberings in the spreadsheet - use the TV_11 IDs in the concept lists below under "Submissions".] For 2012, the same list of 500 concepts will be used but the selection for the light and full tasks may change.
Of the 346 test concepts, the organizers plan to judge a subset of 50 (20 by NIST plus 30 by Quaero), including 20 for the light task from the subset of 50 defined above and for the full task an additional 30 beyond the list of 50 but within the 346 mentioned above. The concepts selected for evaluation both for the light and the full tasks will not be known by participants at submission time so that participants really work on detection methods for large sets of concepts. All of the 10 concept pairs will be evaluated.
Among the 346 concepts of the full task, several are likely to lead to discussions. These are left in the set but they will not be selected for assessing, as is the case for those concepts that are found to be too frequent or too rare in the development set.
The organizers have provided again a set of relations between the concepts. There are two types of relations: A implies B and A excludes B. Relations that can be derived by transitivity will not be included. Participants are free to use the relations or not and submissions are not required to comply with them.
It is expected that advanced methods will use the annotations of non-evaluated concepts and the ontology relations to improve the detection of the evaluated concepts. The use of the additional annotations and of ontology relations is optional and comparison between methods that use them and methods that do not is encouraged.
Three types of submissions will be considered:
P l e a s e n o t e t h e s e r e s t r i c t i o n s and this information on training types. The submission types (light, full and pair) are orthogonal to the training types (A, B, C ...).
Each team may submit a maximum of 4 prioritized runs with 2 additional if they are of the "no annotation" training type and the others are not. Each team may also submit up to 2 "pair" runs. All runs will be evaluated but not all may be included in the pools for judgment. The submission format is described below.
Please note: Only submissions which are valid when checked against the supplied DTDs will be accepted. You must check your submission before submitting it. NIST reserves the right to reject any submission which does not parse correctly against the provided DTD(s). Various checkers exist, e.g., Xerces-J: java sax.SAXCount -v YourSubmision.xml.
A subset of the submitted concept results (at least 20), to be announced only after the submission date, will be evaluated by assessors at NIST or at LIG using pooling and sampling.
Please note that NIST uses a number of rules in manual assessment of system output.
Measures (per run):
The known-item search task models the situation in which someone knows of a video, has seen it before, believes it is contained in a collection, but doesn't know where to look. To begin the search process, the searcher formulates a text-only description, which captures what the searcher remembers about the target video.
In TRECVID 2010, 78% of the known-items were found by at least one run; in 2011 65% were found. Participants are encouraged to focus on why 22% - 35% of known-items were not found by current approaches in 2010 and 2011 and what more successful approaches can be developed to reduce that percentage for the new topics of 2012.
Given a text-only description of the video desired (i.e. a topic) and a test collection of video with associated metadata:
The topic will also contain a list of 1-5 words or short phrases, each identifying an object/person/location that must be visible in the target video
The test data set to be searched (IACC.1.C) will be 200 hours drawn from the IACC.1 collection using videos with durations between 10 seconds and 3.5 minutes.
Approximately 300 new text-only topics will be used. (A subset of 24 to be used for interactive systems.) Here are some example topics with links to the target video. Actual topics will contain a list of "visual cues" - a comma-delimited list of words and phrases that identify people, things, places, actions, etc. which should appear in the target video.
Test data (IACC.1.B) is available by download from NIST. Topics from 2011 are available from the TRECVID Past Data page Last year's (2011) test data (IACC.1.B) is available to active participants by download from NIST.
P l e a s e n o t e t h e s e r e s t r i c t i o n s and this information on training types. .
Each team may submit a maximum of 4 prioritized runs.
Please note: Only submissions which are valid when checked against the supplied DTDs will be accepted. You must check your submission before submitting it. NIST reserves the right to reject any submission which does not parse correctly against the provided DTD(s). Various checkers exist, e.g., Xerces-J: java sax.SAXCount -v YourSubmision.xml.
In addition to the 4 training (A-D) conditions listed above, each search run will declare whether
Here for download (though they may not display properly) is a DTD for search results of one run, one for results from multiple runs, and a small example of what a site would send to NIST for evaluation. Please check your submission to see that it is well-formed.
You may submit all your runs in one or multiple files as long as you do not break a run across files. EACH file you submit should begin, as in the example submission, with the DOCTYPE statement and a videoSearchResults element even if only one run is included.
Submissions will be transmitted to NIST via this webpage. The submission process for this task does NOT check you submission for correctness. You must do that before you submit.
Ground truth will be known when the topic is created. Scoring will be automatic. Automatic runs will be scored against the ground truth using mean inverted rank at which the known item is found or equivalent. Interactive runs will be scored in terms of items found or not, elapsed time, user satisfaction.
Detecting human behaviors efficiently in vast amounts surveillance video is fundamental technology for a variety of higher-level applications of critical importance to public safety and security. The use case addressed by this task is the retrospective exploration of surveillance video archives using a system designed to support the optimal division of labor between a human user and the software - an interactive system.
Given a collection of surveillance data files (e.g. that from an airport, or commercial establishment) for preprocessing, at test time take a small set of topics (search requests for known events) and for each return the elapsed search time and a list of video segments within the surveillance data files, ranked by likelihood of meeting the need described in the topic. Each search for an event by a searcher can take no more than 25 elapsed minutes, measured from the time the searcher is given the event to look for until the time the result set is considered final.
The test data will be the same as was used in the SED task in 2011.
Submissions will follow the same format and procedure as in the SED 2011 task. The number of submissions allowed will be determined by the time the Guidelines are final. Participants must submit at least one interactive run. An automatic version of each interactive run for comparison may also be submitted.
In this pilot, it is assumed the user(s) are system experts and no attempt will be made to separate the contribution of the user and the system. The results for each system+user will be evaluated by NIST as to effectiveness - using standard search measures (e.g., precision, recall, average precision)- self-reported speed, and user satisfaction (for interactive runs).
An important need in many situations involving video collections (archive video search/reuse, personal video organization/search, surveillance, law enforcement, protection of brand/logo use) is to find more video segments of a certain specific person, object, or place, given a visual example.
In 2012 we will have 21 topics. Automatic and interactive systems will use the same set. New data will be available from Flickr under Creative Commons licenses, prepared with major help from several participants in the AXES project (access to audiovisual archives), a four year FP7 framework research project to develop tools that provide various types of users with new and engaging ways to interact with audiovisual libraries.
Given a collection of test clips (files) and a collection of queries that delimit a person, object, or place entity in some example video, locate for each query up to the 1000 clips most likely to contain a recognizable instance of the entity. Interactive runs will likely return many fewer than 1000 clips. Note that this year the unit of measure will be the clip not the shot. The original videos will be automatically divided into clips of an arbitrary length. Each clip must be processed as if no others existed. Each query will consist of a set of 5 or so example frame images (bmp) drawn at intervals from one or two videos containing the item of interest. For each frame image a binary mask of the region of interest will be provided. Each query will also include an indication of the target type taken from this set of strings {PERSON, LOCATION, OBJECT}
Development data: The BBC rushes test data used in 2011 is available from NIST by download to active participants. The instance search queries are available from the Past data section of the TRECVID website.
Test data: The test data for 2012 will be Internet video in webm format, downloaded from Flickr under Creative Commons licenses. There will be 70000+ short files. Each clip must be processed as if no others existed. The example images in the topics will be in bmp format. See above for information on how to get a copy of the test data.
Each team may submit a maximum of 4 prioritized runs. All runs will be evaluated but not all may be included in the pools for judgment. Submissions will be identified as either fully automatic or interactive. Interactive runs will be limited to 15 elapsed minutes per search.
Please note: Only submissions which are valid when checked against the supplied DTDs will be accepted. You must check your submission before submitting it. NIST reserves the right to reject any submission which does not parse correctly against the provided DTD(s). Various checkers exist, e.g., Xerces-J: java sax.SAXCount -v YourSubmision.xml.
Here for download (though they may not display properly) is the DTD for search results of one run, the one for results from multiple runs, and a small example of what a site would send to NIST for evaluation. Please check your submission to see that it is well-formed
You may submit all your runs in one or multiple files as long as you do not break a run across files. EACH file you submit should begin, as in the example submission, with the DOCTYPE statement and a videoSearchResults element even if only one run is included.
Submissions will be transmitted to NIST via this webpage.
This pilot version of the task will treat it as a form of search and will evaluate it accordingly with average precision for each query in each run and per-run mean average precision over all queries. As part of the pilot, alternative evaluation schemes will be discussed and tested if possible. While speed and location accuracy are also definitely of interest here, of these two only speed may be measured in the pilot.
Ever expanding multimedia content on the Internet necessitates development of new technologies for content understanding and search for a wide variety of commerce, research, and government applications.
Given a collection of test videos and a list of test events,
indicate
whether each of the test events is present anywhere in each of
the
test videos and give the strength of evidence for each such
judgment. The 2012 evaluation will offer both the "Pre-Specified
Event Detection" task where developers receive event kits in
advance of building the Content Description Representation (CDR)
and a Pilot "Ad Hoc Event Detection" task where developers are
given event kits after the CDR is frozen.
Systems will be tested on the 20 Pre-Specified events (the 10
MED '11 Test Events and 10 new events) and 5 new Ad Hoc
Events.
HAVIC Training Data: Participants will be provided the
MED '10, MED '11 Development, and MED '11 Test Collections for
system development and internal testing.
HAVIC Testing Data:
A new, ~4,000 hour Progress Test collection will be provided to
participants. The Progress set will be used through MED
'15 as the test collection.
Please refer to 2012 MED evaluation plan (soon to be published) for instructions on the submission process.
Participants will choose to evaluate their system on all events for the task (20 events for Pre-specified event task and/or 5 events for the Ad Hoc event pilot task) or a subset thereof.
Participants will evaluate their systems during development using the ground truth and tools provided by NIST. All participants will report their results in their TRECVID workshop notebook papers.
The latest details are available on the MED webpage.
Once a system locates a specific event in a video clip, a user may wish to analyze the evidence indicating that the event was present. An important goal is for that evidence to be semantically meaningful to a human.
Given an event kit, and a video clip that contains the event, produce a recounting that summarizes the key evidence of the event. For this first evaluation, the recounting will be text-only.
All Multimedia Event Recounting (MER) participants are required to produce a recounting for 30 selected video clips where it is known that the clip contains a specific MER event. There will be five events chosen from the MED pre-specified events list, and six video clips per event.
MER participants who are also participating in the MED (pre-specified) evaluation are required to produce a recounting for each clip that their MED system declares as containing one of the five MER events.
All MER participants will be tested using six video clips for each of the five MER events.
MER participants who also participate in MED will be tested
using six additional video clips for each of the five MER
events, as described in the
2012 MER
Evaluation Specification document.
Test video clips will be selected from the HAVIC Progress Test
Data.
Submissions will be XML compliant plain-ASCII text. A MER DTD and rendering tool will be made available to all participants.
Please refer to the 2012 MER Evaluation Specification document for instructions on the submission process.
The system's recounting summarizations will be evaluated by a panel of judges, first to test if the MER output allows judges who have not seen the clips to identify which event is represented by the recounting; and second to test if the recounting is sufficiently expressive to allow judges to match each of the six MER outputs (from a single system, for a single event) to the specific clip from which it was derived.
All participants will report their results in their TRECVID workshop notebook papers.
The latest details are available on the MER web-page.
The following are the target dates for 2012:
Here is a list of work items that must be completed before the guidelines are considered to be final..
Once subscribed, you can post to this list by sending you thoughts as email to tv12.list@nist.gov, where they will be sent out to EVERYONE subscribed to the list, i.e., all the other active participants.