The main goal of the TREC Video Retrieval Evaluation (TRECVID) is to promote progress in content-based analysis of and retrieval from digital video via open, metrics-based evaluation. TRECVID is a laboratory-style evaluation that attempts to model real world situations or significant component tasks involved in such situations.
Up until 2010, TRECVID used test data from a small number of known professional sources - broadcast news organizations, TV program producers, and surveillance systems - that imposed limits on program style, content, production qualities, language, etc. In 2003 - 2006 TRECVID supported experiments in automatic segmentation, indexing, and content-based retrieval of digital video using broadcast news in English, Arabic, and Chinese. TRECVID also completed two years of pilot studies on exploitation of unedited video rushes provided by the BBC. In 2007 - 2009 TRECVID provided participants with cultural, news magazine, documentary, and education programming supplied by the Netherlands Institute for Sound and Vision. Tasks using this video included segmentation, search, feature extraction, and copy detection. Systems were tested in rushes video summarization using the BBC rushes. Surveillance event detection was evaluated using airport surveillance video provided by the UK Home Office. Many resources created by NIST and the TRECVID community are available for continued research on this data independent of TRECVID. See the Past data section of the TRECVID website for pointers.
In 2010 TRECVID confronted known-item search and semantic indexing systems with a new set of videos (referred to in what follows as IACC.1) characterized by a high degree of diversity in creator, content, style, production qualities, orginal collection device/encoding, language, etc - as is common in much "web video". The collection also has associated keywords and descriptions provided by the video donor. The videos are available under Creative Commons licenses from the Internet Archive. The only selection criteria imposed by TRECVID beyond the Creative Commons licensing will be one of video duration - they will be short (less than 4 minutes). In addition to the IACC.1 data set, NIST began developing an Internet multimedia test collection (HAVIC) with the Linguistic Data Consortium and used it in an exploratory pilot task in TRECVID 2010. The airport surveillance video used in TRECVID 2009 was used again in 2010 as will some of the Sound and Vision (S&V) videos used in 2009.
TRECVID 2011 will continue the 6 tasks from 2010. NIST will evaluate systems on the following tasks using the [data] indicated::
Please note that, due to the date of the ACM Multimedia Conference and to accommodate travelors wanting to attend both ACM MM and TRECVID, the TRECVID 2011 Workshop will be later than usual, namely 5.-7. December 2011.
A number of datasets are available for use in TRECVID 2011 and are described below.
Approximately 8000 Internet Archive videos (50GB, 200 hours) with Creative Commons licenses in MPEG-4/H.264 with duration between 10 seconds and 3.5 minutes. Most videos will have some metadata provided by the donor available e.g., title, keywords, and description
Data use agreements and Distribution: Download for active participants from NIST/mirror servers. See Data use agreements
Master shot reference: Will be available to active participants by download from NIST
Automatic speech recognition (for English): Looking for someone willing to provide this.
Approximately 8000 Internet Archive videos (50GB, 200 hours) with Creative Commons licenses in MPEG-4/H.264 with duration between 10 seconds and 3.5 minutes. Most videos will have some metadata provided by the donor available e.g., title, keywords, and description
Data use agreements and Distribution: Download for active participantsfrom NIST server. See Data use agreements
Master shot reference: Available by download from the TRECVID Past Data page
Automatic speech recognition (for English): Available by download from the TRECVID Past Data page
Approximately 3200 Internet Archive videos (50GB, 200 hours) with Creative Commons licenses in MPEG-4/H.264 with durations between 3.6 and 4.1 minutes Most videos will have some metadata provided by the donor available e.g., title, keywords, and description
Data use agreements and Distribution: Download for active participants from NIST server. See Data use agreements
Master shot reference: Available by download from the TRECVID Past Data page
Common feature annotation: Available by download from the TRECVID Past Data page
Automatic speech recognition (for English): Available by download from the TRECVID Past Data page
Data use agreements and Distribution: tv9.sv.test (114.8 GB) in MPEG-1 by download from NIST server.
Rushes are the raw material from which programs/films are made in the editing room. Rushes to be used in TRECVID 2011 include those for several dramatic series as well as for travel programing. The videos have been segmented at arbitrary points to produce a collection of equal-length, short clips.
Data use agreements and Distribution: Download from NIST server. See Data use agreements
The data consist of about 150 hours of airport surveillance video data (courtesy of the UK Home Office). The Linguistic Data Consortium has provided event annotations for the entire corpus. The corpus was divided into development and evaluation subsets. Annotations for 2008 development and test sets are available.
Data use agreements and Distribution:
Development data annotations: available by download.
HAVIC is designed to be a large new collection of Internet multimedia and is being constructed by the Linguistic Data Consortium and NIST.
Data use agreements and Distribution:
Data licensing and distribution will be handled by the Linguistic Data Consortium. See the Multimedia Event Detection task webpage for details.
In order to be eligible to receive the data, you must have have applied for participation in TRECVID. Your application will be acknowledged by NIST with a team ID, and active participant's password, and information about how to obtain the data.
If you will be using i-LIDS (2009) or HAVIC data, NIST will NOT be handling the data use agreements. See the "Data Use Agreements and Distribution" section for i-LIDS and HAVIC
If you will be using Gatwick(2008), IACC, Sound and Vision, or BBC, you will need to complete the relevant permission forms (from the active participant's area) and email the scanned page images as one Adobe Acrobat pdf of the document to In your email include the following:
As Subject: "TRECVID data request" In the body: your name your short team ID (given when you applied to participate) the kinds of data you will be using - one or more of the following: S&V, Gatwick (2008), IACC.1, and/or BBC
Please ask only for the test data (and optional development data) required for the task(s) you apply to participate in and intend to complete.
Within a few days after the permission forms have been received, you will be emailed the access codes you need to download the data using the information about data servers in the the active participant's area.
This task will be coordinated by Georges Quénot - with Franck Thollard, Andy Tseng, Bahjat Safadi from the Laboratoire d'Informatique de Grenoble and Stéfane Ayache from the Laboratoire d'Informatique Fondamentale de Marseille using support from the Quaero Programme and in collaboration with NIST.
Automatic assignment of semantic tags representing high-level features or concepts to video segments can be fundamental technology for filtering, categorization, browsing, search, and other video exploitation. New technical issues to be addressed include methods needed/possible as collection size and diversity increase, when the number of features increases, and when features are related by an ontology.
Given the test collection, master shot reference, and concept/feature definitions, return for each feature a list of at most 2000 shot IDs from the test collection ranked according to the possibility of detecting the feature.
The test data set (IACC.1.B) will be 200 hours drawn from the IACC.1 collection using videos with durations between 10 seconds and 3.5 minutes.
A development data sets (IACC.1.A) and (IACC.1.tv10.training) and will each be 200 hours drawn from the IACC.1 collection using videos with durations ranging from 10s to just longer than 3.5 minutes.
Two types of submissions will be considered:
346 concepts have been selected for the TRECVID 2011 semantic indexing task. In making this selection the organizers have drawn from the 130 used in TRECVID 2010, the 374 selected by CU/Vireo for which there exist annotations on TRECVID 2005 data, and some from the LSCOM ontology. The 346 concepts are those for which there exist at least 4 positive samples in the final annotation. A spreadsheet of the concepts is available here with complete definitions and an alignment with CU-VIREO374 where appropriate. Don't be confused by the multiple numberings - use the TV_10 IDs.
The organizers have provided again a set of relations between the concepts. There are two types of relations: A implies B and A excludes B. Relations that can be derived by transitivity will not be included. Participants are free to use the relations or not and submissions are not required to comply with them The organizers again organized a collaborative annotation for the new concepts. The annotations will be provided in the same format as in the previous years but they will not be complete as in the previous years. The ontology relations and an active learning method will be used in the collaborative annotation to ensure that each annotation made is as useful as possible and especially that as many positive samples of the sparse concepts as possible are obtained while the negative ones are as close as possible to the class boundary. The actual fraction of the collection that will be annotated will depend upon the efficiency of the use of ontology relations and upon the number of teams that participate to the collaborative annotation. As in the previous years, only the teams that have completed their share of the annotations will have access to the full annotation.
Of the 346 test concepts the organizers plan to judge a subset of 50 (20 by NIST plus 30 by Quaero), including 20 for the light task from the subset of 50 defined above and for the full task an additional 30 beyond the list of 50 but within the 346 mentioned above. The concepts selected for evaluation both for the light and the full tasks will not be known by participants at submission time so that participants really work on detection methods for large sets of concepts.
Among the 346 concepts of the full task, several are likely to lead to discussions. These are left in the set but they will not be selected for assessing, as is the case for those concepts that are found to be too frequent or too rare in the development set.
It is expected that advanced methods will use the annotations of non-evaluated concepts and the ontology relations to improve the detection of the evaluated concepts. The use of the additional annotations and of ontology relations is optional and comparison between methods that use them and methods that do not is encouraged.
P l e a s e n o t e t h e s e r e s t r i c t i o n s and this information on training types.
Each team may submit a maximum of 4 prioritized runs. All runs will be evaluated but not all may be included in the pools for judgment. The submission format is described below.
Please note: Only submissions which are valid when checked against the supplied DTDs will be accepted. You must check your submission before submitting it. NIST reserves the right to reject any submission which does not parse correctly against the provided DTD(s). Various checkers exist, e.g., Xerces-J: java sax.SAXCount -v YourSubmision.xml.
A subset of the submitted feature results (at least 20), to be announced only after the submission date, will be evaluated by assessors at NIST using pooling and sampling.
Please note that NIST uses the a number of rules in manual assessment of system output.
Measures (per run):
The known-item search task models the situation in which someone knows of a video, has seen it before, believes it is contained in a collection, but doesn't know where to look. To begin the search process, the searcher formulates a text-only description, which captures what the searcher remembers about the target video.
Given a text-only description of the video desired (i.e. a topic) and a test collection of video with associated metadata:
The topic will also contain a list of 1-5 words or short phrases, each identifying an object/person/location that must be visible in the target video
The test data set to be searched (IACC.1.B) will be 200 hours drawn from the IACC.1 collection using videos with durations between 10 seconds and 3.5 minutes.
Approximately 300 new text-only topics will be used. (A subset of 25 to be used for interactive systems.) Here are some example topics with links to the target video. Actual topics will contain a list of "visual cues" - a comma-delimited list of words and phrases that identify people, things, places, actions, etc. which should appear in the target video.
Test data (IACC.1.B) is available by download from NIST. Topics from 2010 are available from the TRECVID Past Data page Last year's (2010) test data (IACC.1.A) is available to active participants by download from NIST.
P l e a s e n o t e t h e s e r e s t r i c t i o n s and this information on training types. .
Each team may submit a maximum of 4 prioritized runs.
Please note: Only submissions which are valid when checked against the supplied DTDs will be accepted. You must check your submission before submitting it. NIST reserves the right to reject any submission which does not parse correctly against the provided DTD(s). Various checkers exist, e.g., Xerces-J: java sax.SAXCount -v YourSubmision.xml.
In addition to the 4 training (A-D) conditions listed above, each search run will declare whether
Here for download (though they may not display properly) is a DTD for search results of one run, one for results from multiple runs, and a small example of what a site would send to NIST for evaluation. Please check your submission to see that it is well-formed.
You may submit all your runs in one or multiple files as long as you do not break a run across files. EACH file you submit should begin, as in the example submission, with the DOCTYPE statement and a videoSearchResults element even if only one run is included.
Submissions will be transmitted to NIST via a webpage. The submission process for this task does NOT check you submission for correctness. You must do that before you submit.
Ground truth will be known when the topic is created. Scoring will be automatic. Automatic runs will be scored against the ground truth using mean inverted rank at which the known item is found or equivalent. Interactive runs will be scored in terms of items found or not, elapsed time, user satisfaction.
As used here, a copy is a segment of video derived from another video, usually by means of various transformations such as addition, deletion, modification (of aspect, color, contrast, encoding, ...), camcording, etc. Detecting copies is important for copyright control, business intelligence and advertisement tracking, law enforcement investigations, etc. Content-based copy detection offers an alternative to watermarking. The TRECVID copy detection task will be based on the framework tested in TRECVID 2008, which used the CIVR 2007 Muscle benchmark.
Based on results from 2008-2009 and other evidence of the importance of a full multimedia approaches to copy detection, only one sort of query will be tested in 2011: audio+video. Video-only and audio-only queries will be created in MPEG-1 format and distributed but not tested again separately.
At least two runs will be required - one for each of two application profiles. One profile will aim to reduce the false alarm rate to 0 and then optimize the probability of miss and the speed. The second will set an equal cost for false alarms and misses.
The required system task will be as follows: given a test collection of videos and a set of queries, determine for each query the place, if any, that some part of the query occurs, with possible transformations, in the test collection. The set of possible transformations will be based to the extent possible on actually occurring transformations.
Each query will be constructed using tools developed by IMEDIA to include some randomization at various decision points in the construction of the query set. Some manual procedures (e.g. for the camcording transformation) were used in 2008. The automatic tools developed by IMEDIA for TRECVID 2008 are available for download.
As in 2008, 2009 NIST will create a set of audio-only queries and a set of video-only queries for download and provide an ffmpeg script which participants will use to recombine the transformed video-only and audio-only queries to create the ~10000 audio+video queries.
For each query, the tools will take a segment from the test collection, optionally transform it, embed it in some video segment which does not occur in the test collection, and then finally apply a video transformation the entire query segment. Analogous processing will be used to create audio-only versions of each query. Some queries may contain no test segment; others may be composed entirely of the test segment. Video transformations used in 2008 are documented in the general plan for query creation. and in the final video transformations document with examples.. For 2011 we will use the subset of the 2008 video transformation listed here:
The audio-only component queries will be generated along the same lines as the video-only component queries: a set of base audio-only queries is transformed by several techniques that are intended to be typical of those that would occur in real reuse scenarios: (1) bandwidth limitation (2) other coding-related distortion (e.g. subband quantization noise) (3) variable mixing with unrelated audio content. The audio transformations to used in 2011 (same as in 2009) are documented here. The transformed queries will be downloadable from NIST.
The final audio+video queries will consist of the aligned versions of transformed audio and video queries, i.e, they will be various combinations of transformed audio and transformed video from a given base audio+video query. In this way sites can study the effectiveness of their systems for individual audio and video transformations and their combinations. These queries will not be downloadable. Rather, NIST will provide a list of how to construct each final audio+video test query so that, given the audio-only queries and the video-only queries, sites can use a tool such as ffmpeg to construct the audio+video queries. Contructing the final queries at NIST for download would not be practical given the enormous size of that data.
For testing, the reference data will be identical to that used in 2010, ~12000 files: IACC.1.A + IACC.1.tv10.training. The collection.xml file covering both data sets is available here. Both files with use "devel" and those with use "test" are being used as test files for the copy detection task in 2011. The non-reference data will not be identified other than to say it will be drawn from Internet Archive videos available under Creative Commons licenses. Since the query set is large and randomized, reusing the test data should allow comparison of 2011 systems to those from 2010.
For development, the reference video and copy detection queries used in TRECVID 2009 will be available (tv9.sv.test, tv7.sv.devel, tv7.sv.test) from the NIST server - though these videos will differ markedly from the 2011 test data. See these instructions for information on how to obtain these videos.
Each team may submit at most 4 runs. As explained in the evaluation document at least two runs are required from each team. Information on submission of copy detection runs has been collected in a separate document.
Submissions will be transmitted to NIST via a webpage.
Detecting human behaviors efficiently in vast amounts surveillance video, both retrospectively and in realtime, is fundamental technology for a variety of higher-level applications of critical importance to public safety and security.
In light of results for 2010, in 2011 we will rerun the 2009 task/data using the 2009 ground truth. Systems will process 45 hours of data and are evaluated on 15 hours for the selected events. Evaluation will be in terms of a cost function based on weighted sum of probability of miss and rate of false alarms
An important need in many situations involving video collections (archive video search/reuse, personal video organization/search, surveillance, law enforcement, protection of brand/logo use) is to find more video segments of a certain specific person, object, or place, given a visual example.
In 2011 this will again be a pilot task - evaluated by NIST but intended mainly to explore task definition and evaluation issues using data and an evaluation framework in hand - in a first approximation to the desired full task using a smaller number of topics, a simpler identification of the target entity, and less accuracy in locating the instance than would be desirable in a full evaluation of the task.
Given a collection of test clips (files) and a collection of queries that delimit a person, object, or place entity in some example video, locate for each query up to the 1000 clips most likely to contain a recognizable instance of the entity. Interactive runs will likely return many fewer than 1000 clips. Note that this year the unit of measure will be the clip not the shot. NIST will automatically divide the test videos into clips of an arbitrary length. Each query will consist of a set of 5 or so example frame images drawn at intervals from a video containing the item of interest. For each frame image a binary mask of the region of interest will be provided. Each query will also include an indication of the target type taken from this set of strings {PERSON, CHARACTER, LOCATION, OBJECT}
Development data: The test data used in 2010 (tv9.sv.test) is available from NIST by download to active participants. The instance search queries are available from the Past data section of the TRECVID website.
Test data: BBC rushes. In this second approximation to the full task, most queries will target objects that recur in the rushes. Recurring people and locations will be included as needed to create a set of approximately 25 topics. We also plan to apply some transformations at random to the test clips - approximating differences you might see if a clip came from a different camera or was taken under different lighting conditions, etc.
The rushes will be automatically divided into short, roughly, equal-length clips and renamed so the clip name does not indicate the original video. Each clip must be processed as if no others existed.
Each team may submit a maximum of 4 prioritized runs. All runs will be evaluated but not all may be included in the pools for judgment. Submissions will be identified as either fully automatic or interactive. Interactive runs will be limited to 15 elapsed minutes per search.
Please note: Only submissions which are valid when checked against the supplied DTDs will be accepted. You must check your submission before submitting it. NIST reserves the right to reject any submission which does not parse correctly against the provided DTD(s). Various checkers exist, e.g., Xerces-J: java sax.SAXCount -v YourSubmision.xml.
Here for download (though they may not display properly) is the DTD for search results of one run, the one for results from multiple runs, and a small example of what a site would send to NIST for evaluation. Please check your submission to see that it is well-formed
You may submit all your runs in one or multiple files as long as you do not break a run across files. EACH file you submit should begin, as in the example submission, with the DOCTYPE statement and a videoSearchResults element even if only one run is included.
Submissions will be transmitted to NIST via a webpage.
This pilot version of the task will treat it as a form of search and will evaluate it accordingly with average precision for each query in each run and per-run mean average precision over all queries. As part of the pilot, alternative evaluation schemes will be discussed and tested if possible. While speed and location accuracy are also definitely of interest here, of these two only speed may be measured in the pilot.
Ever expanding multimedia content on the Internet necessitates development of new technologies for content understanding and search for a wide variety of commerce, research, and government applications.
Given a collection of test videos and a list of test events, indicate whether each of the test events is present anywhere in each of the test videos and give the strength of evidence for each such judgment.
Last year's events for the 2010 MED pilot task were:
15 new events with textual descriptions will be released for the 2011 MED task. They will be described in detail in the March 1st event kit release.
The MED '11 evaluation will make use of a new, 40,000 video clip data set. 15 new events will be defined and used for MED '11
Please refer to Appendix B in the 2011 MED evaluation plan for instructions on the submission process.
From the set of events defined by NIST, participants will choose which event(s) to evaluate.
Participants will evaluate their systems using the ground truth and tools provided by NIST. All participants will report their results in their TRECVID workshop notebook papers.
The latest details are available on the MED webpage.
The following are the target dates for 2011:
Here is a list of work items that must be completed before the guidelines are considered to be final..
Once subscribed, you can post to this list by sending you thoughts as email to tv11.list@nist.gov, where they will be sent out to EVERYONE subscribed to the list, i.e., all the other active participants.