The main goal of the TREC Video Retrieval Evaluation (TRECVID) is to promote progress in content-based analysis of and retrieval from digital video via open, metrics-based evaluation. TRECVID is a laboratory-style evaluation that attempts to model real world situations or significant component tasks involved in such situations.
Up until 2010, TRECVID used test data from a small number of known professional sources - broadcast news organizations, TV program producers, and surveillance systems - that imposed limits on program style, content, production qualities, language, etc. In 2003 - 2006 TRECVID supported experiments in automatic segmentation, indexing, and content-based retrieval of digital video using broadcast news in English, Arabic, and Chinese. TRECVID also completed two years of pilot studies on exploitation of unedited video rushes provided by the BBC. In 2007 - 2009 TRECVID provided participants with cultural, news magazine, documentary, and education programming supplied by the Netherlands Institute for Sound and Vision. Tasks using this video included segmentation, search, feature extraction, and copy detection. Systems were tested in rushes video summarization using the BBC rushes. Surveillance event detection was evaluated using airport surveillance video provided by the UK Home Office. Many resources created by NIST and the TRECVID community are available for continued research on this data independent of TRECVID. See the Past data section of the TRECVID website for pointers.
In 2010 TRECVID confronted known-item search and semantic indexing systems with a new set of videos (referred to in what follows as IACC.1) characterized by a high degree of diversity in creator, content, style, production qualities, orginal collection device/encoding, language, etc - as is common in much "web video". The collection also has associated keywords and descriptions provided by the video donor. The videos are available under Creative Commons licenses from the Internet Archive. The only selection criteria imposed by TRECVID beyond the Creative Commons licensing will be one of video duration - they will be short (less than 4 minutes). In addition to the IACC.1 data set, NIST is developing an Internet multimedia test collection (HAVIC) with the Linguistic Data Consortium and plans to use it in an exploratory pilot task in TRECVID 2010. The airport surveillance video (Gatwick and i-LIDS MCT) used in TRECVID 2009 will be used again in 2010 as will some of the Sound and Vision (S&V) videos used in 2009.
In TRECVID 2010 NIST evaluated systems on the following tasks using the [data] indicated::
TRECVID 2010 also offered the following pilot task:
A number of datasets are available for use in TRECVID 2010. They are described below with their associated data and information about data distribution is provided.
Approximately 8000 Internet Archive videos (50GB, 200 hours) with Creative Commons licenses in MPEG-4/H.264 with duration between 10 seconds and 3.5 minutes. Most videos will have some metadata provided by the donor available e.g., title, keywords, and description
Distribution: Download from NIST/mirror servers.
Master shot reference: Available by download to active participants
Automatic speech recognition (for English): Looking for someone willing to provide this.
Approximately 3200 Internet Archive videos (50GB, 200 hours) with Creative Commons licenses in MPEG-4/H.264 with durations between 3.6 and 4.1 minutes Most videos will have some metadata provided by the donor available e.g., title, keywords, and description
Distribution: Download from NIST/mirror servers, from archive.org, or send NIST an empty IDE hard drive formatted for NTFS to be filled and returned.
Master shot reference: Available by download to active participants
Common feature annotation: To be provided by download to teams that contribute annotations
Automatic speech recognition (for English): Looking for someone willing to provide this.
tv9.sv.test (114.8 GB) in MPEG-1 is available from NIST by download. See instructions,.
The data consist of about 150 hours of airport surveillance video data (courtesy of the UK Home Office). The Linguistic Data Consortium has provided event annotations for the entire corpus. The corpus was divided into development and evaluation subsets. Annotations for 2008 development and test sets are available.
Distribution:
Gatwock development data (2008 DevSet and 2008 EvalSet) by download from password-protected servers at NIST and mirror sites; new 2009 i-LIDS test data from UK Home Office. See here for details.
Development data annotations: available by download.
HAVIC is designed to be a large new collection of Internet multimedia. Construction by the Linguistic Data Consortium and NIST will begin early in 2010.
Distribution:
See the Multimedia Event Detection task webpage for details.
In order to be eligible to receive the data, you must have have applied for participation in TRECVID. Your application will be acknowledged by NIST with a team ID, and active participant's password, and information about how to obtain the data.
Then you will need to complete the relevant permission forms (from the active participant's area) and email the scanned page images as one Adobe Acrobat pdf of the document to In your email include the following:
As Subject: "TRECVID data request" In the body: your name your short team ID (given when you applied to participate) the kinds of data you will be using (S&V, i-LIDS, IACC.1)
Please ask only for the test data (and optional development data) required for the task(s) you apply to participate in and intend to complete.
Within a few days after the permission forms have been received, you will be emailed the access codes you need to download the data using the information about data servers in the the active participant's area.
This task will be coordinated by Georges Quénot - with Franck Thollard, Andy Tseng, Bahjat Safadi from the Laboratoire d'Informatique de Grenoble and Stéfane Ayache from the Laboratoire d'Informatique Fondamentale de Marseille using support from the Quaero Programme and in collaboration with NIST.
Automatic assignment of semantic tags to video segments can be fundamental technology for filtering, categorization, browsing, search, and other video exploitation. New technical issues to be addressed include methods needed/possible as collection size and diversity increase, when the number of features increases, and when features are related by an ontology.
Given the test collection, master shot reference, and feature definitions, return for each feature a list of at most 2000 shot IDs from the test collection ranked according to the possibility of detecting the feature.
The test data set (IACC.1.A) will be 200 hours drawn from the IACC.1 collection using videos with durations between 10 seconds and 3.5 minutes.
A development data set (IACC.1.tv10.training) will be 200 hours drawn from the IACC.1 collection using videos with durations just longer than 3.5 minutes.
130 concepts have been selected for the TRECVID 2010 semantic indexing task. These include all the TRECVID "high level features" from 2005 to 2009 plus a selection of LSCOM concepts so that we end up with a number of generic-specific relations among them. The goal is to promote research on methods for indexing many concepts and using ontology relations between them. Also it is expected that these concepts will be useful for the content-based (known item) search task.
All these concepts will be annotated in the collaborative annotation but only a fraction of them (20) will be evaluated by NIST. It is expected that advanced methods will use the annotations of non-evaluated concepts and the ontology relations to improve the detection of the evaluated concepts. The use of the additional annotations and of ontology relations is optional and comparison between methods that use them and methods that do not is encouraged. Some concepts that were judged problematic in previous issues like "corporate leader" or "entertainment" have been kept in the selection but these will not be among the evaluated ones. Participants are free to use them or not.
Two types of submissions will be considered: "full" or "regular" in which participants will be required to provide a result for all the proposed concepts and "light" which participants will be required to provide a result only for a predefined list of 10 concepts. The 10 features for the light runs are as follows (giving the feature number, short name, and desciption. See Table 1 (an Excel spreadsheet) for more information):
The list of concepts that will be evaluated for the full submissions in addition to those that will be evaluated for the light submissions will not be known before the submission deadline so that participants in the full evaluation will really work on detection methods for large sets of concepts.
TRECVID 2005 to 2009 HLF have been included in the concept set in order to favor the reuse of already available annotations and judgments and to encourage cross-domain evaluations though these might not be part of the TRECVID 2010 evaluation. The concept definitions may have slightly changed for the following reasons: when two (or more) very similar HLFs were found in different TRECVID editions, they were merged into a single TRECVID 2010 concept; when one or more previous TRECVID HLFs were found to be very similar to an LSCOM concept, whether this was intentional or not when defining these HLFs, they were mapped to this LSCOM concept. Finally, almost all previous TRECVID HLFs could be matched to an LSCOM concept and this one will become the TRECVID 2010 reference. In some cases, the match is exact or almost exact; in other cases, the match is only approximate.
Table 1 (in an Excel spreadsheet) gives the list of TRECVID 2010 concepts with a main reference to their LSCOM identifier when available. It also gives the correspondence with HLFs from previous TRECVID editions while indicating if the match is considered to be exact, almost exact or only approximate. This table will be completed with matches with concepts from other evaluation campaigns (e.g. Pascal VOC challenge or image CLEF), again for encouraging and making easier cross-domain evaluations.
The annotations will be provided in the same format as in the previous years but they will not be complete as in the previous years. The ontology relations and an active learning method will be used in the collaborative annotation to ensure that each annotation made is as useful as possible and especially that as many positive samples of the sparse concepts as possible are obtained while the negative ones are as close as possible to the class boundary. The actual fraction of the collection that will be annotated will depend upon the efficiency of the use of ontology relations and upon the number of teams that participate to the collaborative annotation. As in the previous years, only the teams that have completed their share of the annotations will have access to the full annotation.
Ontology relations are available in a text file with two types of relations: A implies B and A excludes B. Relations that can be derived by transitivity will not be included. Participants are free to use the relations or not and submissions are not required to comply with them.
P l e a s e n o t e t h e s e r e s t r i c t i o n s and this information on training types.
Each team may submit a maximum of 4 prioritized runs. All runs will be evaluated but not all may be included in the pools for judgment. The submission format is described below..
Please note: Only submissions which are valid when checked against the supplied DTDs will be accepted. You must check your submission before submitting it. NIST reserves the right to reject any submission which does not parse correctly against the provided DTD(s). Various checkers exist, e.g., Xerces-J: java sax.SAXCount -v YourSubmision.xml.
A subset of the submitted feature results (at least 20), to be announced only after the submission date, will be evaluated by assessors at NIST using pooling and sampling.
Please note that NIST uses the a number of rules in manual assessment of system output.
Measures (per run):
The known-item search task models the situation in which someone knows of a video, has seen it before, believes it is contained in a collection, but doesn't know where to look. To begin the search process, the searcher formulates a text-only description, which captures what the searcher remembers about the target video.
Given a text-only description of the video desired (i.e. a topic) and a test collection of video with associated metadata:
The topic will also contain a list of 1-5 words or short phrases, each identifying an object/person/location that must be visible in the target video
The test data set (IACC.1.A) will be 200 hours drawn from the IACC.1 collection using videos with durations between 10 seconds and 3.5 minutes. 200 additional hours will be added each year up to a total of 600 hours.
100 - 200 new text-only topics each year. (A subset of 24 to be used for interactive systems.) Here are some example topics with links to the target video. Actual topics will contain a list of "visual cues" - a comma-delimited list of words and phrases that identify people, things, places, actions, etc. which should appear in the target video.
P l e a s e n o t e t h e s e r e s t r i c t i o n s and this information on training types. .
Each team may submit a maximum of 4 prioritized runs.
Please note: Only submissions which are valid when checked against the supplied DTDs will be accepted. You must check your submission before submitting it. NIST reserves the right to reject any submission which does not parse correctly against the provided DTD(s). Various checkers exist, e.g., Xerces-J: java sax.SAXCount -v YourSubmision.xml.
In addition to the 4 training (A-D) conditions listed above, each search run will declare whether
Here for download (though they may not display properly) is a DTD for search results of one run, one for results from multiple runs, and a small example of what a site would send to NIST for evaluation. Please check your submission to see that it is well-formed.
You may submit all your runs in one or multiple files as long as you do not break a run across files. EACH file you submit should begin, as in the example submission, with the DOCTYPE statement and a videoSearchResults element even if only one run is included.
Submissions will be transmitted to NIST via this webpage. The submission process for this task does NOT check you submission for correctness. You must do that before you submit.
Ground truth will be known when the topic is created. Scoring will be automatic. Automatic runs will be scored against the ground truth using mean inverted rank at which the known item is found or equivalent. Interactive runs will be scored in terms of items found or not, elapsed time, user satisfaction.
As used here, a copy is a segment of video derived from another video, usually by means of various transformations such as addition, deletion, modification (of aspect, color, contrast, encoding, ...), camcording, etc. Detecting copies is important for copyright control, business intelligence and advertisement tracking, law enforcement investigations, etc. Content-based copy detection offers an alternative to watermarking. The TRECVID copy detection task will be will be based on the framework tested in TRECVID 2008, which used the CIVR 2007 Muscle benchmark.
Based on results from 2008-2009 and other evidence of the importance of a full multimedia approaches to copy detection, only one sort of query will be tested in 2010: audio+video. Video-only and audio-only queies will be created in MPEG-1 format and distributed but not tested again separately.
At least two audio+video query runs will be required - one for each of two application profiles. One profile will aim to reduce the false alarm rate to 0 and then optimize the probability of miss and the speed. The second will set an equal cost for false alarms and misses.
The required system task will be as follows: given a test collection of videos and a set of queries, determine for each query the place, if any, that some part of the query occurs, with possible transformations, in the test collection. The set of possible transformations will be based to the extent possible on actually occurring transformations.
Each query will be constructed using tools developed by IMEDIA to include some randomization at various decision points in the construction of the query set. Some manual procedures (e.g. for the camcording transformation) were used in 2008. The automatic tools developed by IMEDIA for TRECVID 2008 are available for download.
As in 2008, 2009 NIST will create a set of audio-only queries and a set of video-only queries for download and provide an ffmpeg script which participants will use to recombine the transformed video-only and audio-only queries to create the ~10000 audio+video queries.
For each query, the tools will take a segment from the test collection, optionally transform it, embed it in some video segment which does not occur in the test collection, and then finally apply a video transformation the entire query segment. Analogous processing will be used to create audio-only versions of each query. Some queries may contain no test segment; others may be composed entirely of the test segment. Video transformations used in 2008 are documented in the general plan for query creation. and in the final video transformations document with examples.. For 2010 we will use the subset of the 2008 video transformation listed here:
The audio-only queries will be generated along the same lines as the video-only queries: a set of base audio-only queries is transformed by several techniques that are intended to be typical of those that would occur in real reuse scenarios: (1) bandwidth limitation (2) other coding-related distortion (e.g. subband quantization noise) (3) variable mixing with unrelated audio content. The audio transformations to used in 2010 (same as in 2009) are documented here. The transformed queries will be downloadable from NIST.
The audio+video queries will consist of the aligned versions of transformed audio and video queries, i.e, they will be various combinations of transformed audio and transformed video from a given base audio+video query. In this way sites can study the effectiveness of their systems for individual audio and video transformations and their combinations. These queries will not be downloadable. Rather, NIST will provide a list of how to construct each audio+video test query so that given the audio-only queries and the video-only queries, sites can use a tool such as ffmpeg to construct the audio+video queries.
For testing, the reference data will be identical to the 400 hours and ~12000 files in the test and training data for the semantic indexing task: IACC.1.A + IACC.1.tv10.training. The non-reference data will not be identified other than to say it will be drawn from Internet Archive videos available under Creative Commons licenses.
For development, the reference video and copy detection queries used in TRECVID 2009 will be available (tv9.sv.test, tv7.sv.devel, tv7.sv.test) from the NIST server - though these videos will differ markedly from the 2010 test data. See these instructions for information on how to obtain these videos.
Each team may submit at most 4 runs. As explained in the evaluation document at least two runs are required from each team. Information on submission of copy detection runs has been collected in a separate document.
Submissions will be transmitted to NIST via this webpage.
Detecting human behaviors efficiently in vast amounts surveillance video, both retrospectively and in realtime, is fundamental technology for a variety of higher-level applications of critical importance to public safety and security.
In light of results for 2009, in 2010 we will rerun the 2009 task/data using the 2009 ground truth. Systems will process 45 hours of data and are evaluated on 15 hours for the selected events. Evaluation will be in terms of a cost function based on weighted sum of probability of miss and rate of false alarms
The following changes to the 2009 task are planned for 2010:
An important need in many situations involving video collections (archive video search/reuse, personal video organization/search, surveillance, law enforcement, protection of brand/logo use) is to find more video segments of a certain specific person, object, or place, given a visual example.
In 2010 this will be a pilot task - evaluated by NIST but intended mainly to explore task definition and evaluation issues using data and an evaluation framework in hand - in a first approximation to the desired full task using a smaller number of topics, a simpler identification of the target entity, and less accuracy in locating the instance than would be desirable in a full evaluation of the task.
Given a collection of test videos, a master shot reference, and a collection of queries that delimit a person, object, or place entity in some example video, locate for each query the 1000 shots most likely to contain a recognizable instance of the entity. Each query will consist of a set of 5 or so example frame images drawn at intervals from a video containing the item of interest. For each frame image several versions of the image marking the target region will be provided along with the name of video from which the example images were taken and the type of the target (PERSON, CHARACTER, OBJECT, or LOCATION).
Test data: Sound and Vision data from TRECVID 2007-2009 (tv9.sv.test). In a first approximation to the full task, most queries will target actors that recur in Sound and Vision programs - in different clothes, costumes, settings, etc. Where possible we will look for other targets.
As this is pilot task, participants are encouraged to help by examining the test data and contributing up to 5 topics per team with non-person/character targets. Each such topic is to be sent only to NIST by 7. June and must include the names of one or more Sound and Vision files from the test set and one or more Sound and Vision files from outside the test set. NIST will remove duplicates and select a subset of such topics for inclusion with the topics produced by NIST.
Each team may submit a maximum of 4 prioritized runs. All runs will be evaluated but not all may be included in the pools for judgment.
Please note: Only submissions which are valid when checked against the supplied DTDs will be accepted. You must check your submission before submitting it. NIST reserves the right to reject any submission which does not parse correctly against the provided DTD(s). Various checkers exist, e.g., Xerces-J: java sax.SAXCount -v YourSubmision.xml.
Here for download (though they may not display properly) is the DTD for search results of one run, the one for results from multiple runs, and a small example of what a site would send to NIST for evaluation. Please check your submission to see that it is well-formed
You may submit all your runs in one or multiple files as long as you do not break a run across files. EACH file you submit should begin, as in the example submission, with the DOCTYPE statement and a videoSearchResults element even if only one run is included.
Submissions will be transmitted to NIST via this webpage.
This pilot version of the task will treat it as a form of search and will evaluate it accordingly with average precision for each query in each run and per-run mean average precision over all queries. As part of the pilot, alternative evaluation schemes will be discussed and tested if possible. While speed and location accuracy are also definitely of interest here, of these two only speed may be measured in the pilot.
Ever expanding multimedia content on the Internet necessitates development of new technologies for content understanding and search for a wide variety of commerce, research, and government applications.
In 2010 this task will be treated as exploratory, i.e., the emphasis will be on supporting initial exploration of the new video collection, task definition, evaluation framework, and a variety of technical approaches to the system task - not system rankings.
Given a collection of test videos and a list of test events, indicate whether each of the test events is present anywhere in each of the test videos and give the strength of evidence for each such judgment.
3 new events with textual descriptions have been defined for the 2010 pilot task. The events will be:
100 hours of short, diverse videos from the Internet, will be divided into two datasets of approximately equal duration that will be distributed: one for development and one for testing.
Submissions to NIST will be required only to allow NIST to bootstrap improvement of the test set ground truth. After the submission deadline, each submitting site will be provided with the test data ground truth.
From the set of events defined by NIST, participants will choose which event(s) to evaluate.
Participants will evaluate their systems using the ground truth and tools provided by NIST. All participants will report their results in their TRECVID workshop notebook papers along with all of the following::
For this 2010 pilot evaluation, NIST will not associate the identification of sites with their results in publications. But NIST will openly present an overview and discuss the results at the TRECVID workshop.
The latest details are available on the MED webpage.
The following are the target dates for 2010:
NOTE: The schedule for the exploratory task - event detection in Internet multimedia - will be developed by NIST with participants over the next several months. Sample data is expected to be available by June 2010.
Here is a list of work items that must be completed before the guidelines are considered to be final..
Once subscribed, you can post to this list by sending you thoughts as email to tv10list@nist.gov, where they will be sent out to EVERYONE subscribed to the list, i.e., all the other active participants.