Video to Text Description (VTT)
Automatic annotation of videos using natural language text descriptions has been a long-standing goal
of computer vision. The task involves understanding of many concepts such as objects, actions, scenes,
person-object relations, temporal order of events and many others. In recent years there have been major
advances in computer vision techniques that enabled researchers to try to solve this problem. A lot of use case application scenarios can greatly benefit from such technology
such as video summarization in the form of natural language, facilitating the search and browsing of
video archives using such descriptions, describing videos to the blind, etc. In addition, learning video
interpretation and temporal relations of events in the video will likely contribute to other computer
vision tasks, such as prediction of future events from the video.
System Task
The VTT Task has tradionally been divided into two subtasks. However, this year we will only conduct the main Description Generation task.
The Matching and Ranking, and Fill-in-the-Blanks subtasks from previous editions have been discontinued.
Description Generation:
Given a set of X short videos (between 3 and 10 seconds long), the subtask is as follows:
- For each video, automatically generate a single sentence that best describes the video using natural language.
- The systems will also provide a confidence score between 0 and 1 for each generated sentence, which tells us how confident the system is in the "goodness" of the generated sentence. This score will be helpful for analysis and will not be used during the evaluation.
Data Resources
The Test Dataset
The test dataset will contain video segments from the V3C1 collection, which is part of the Vimeo Creative Commons Collection (V3C). V3C1 consists of 7475 Vimeo videos, which total about 1000 hours. These videos are divided into over 1 million segments. For this task, we only use a small subset of the segments, where each segment is between 3 and 10 seconds long.
Approximately 2000 videos will be selected and annotated for the description generation subtask. Each video will be annotated 5 times.
You must sign and return
this agreement
to access the data. Please sign and return the form as soon as
possible to ensure that you can download the test data as it becomes
available.
Annotators will be asked to include and combine in 1 sentence, if appropriate and available, four facets of the video they are describing:
- Who is the video showing? This includes concrete objects and beings (kinds of persons, animals, things, etc.)
- What are the objects and beings doing? This includes generic actions, conditions/state or events.
- Where is the video taken? This can be the locale, site, or place (geographic or architectural).
- When is the video taken? For example, the time of the day, or the season, etc.
The distribution of the test dataset will be announced via the active participants mailing list and according to the published schedule.
Run Submission Types
Systems are required to choose between run types based on the types of training data and features used. We are concerned with the type of training data used (images and/or video) and the types of features used (visual only or audio+visual). The run type is to be written as follows:
runtype = <training_data_type> <training_features_type>
The training data types are specified as follows:
- I: Only image captioning datasets were used for training.
- V: Only video captioning datasets were used for training.
- B: Both image and video captioning datasets were used for training.
The feature types are specified as follows:
- V: Only visual features are used.
- A: Both audio and visual features are used.
Run Submission Format:
The first three lines of each run submission are reserved, and should include the information listed below (preferably in the given order).
- Each run file must use the first line to declare the task.
- D: Description Generation Subtask
The first line for all run files will be:
Task=D
- The second line of the run file is used to declare the run type. For example, a run that uses only video captioning datasets for training, and uses both audio and visual features is specificed as follows:
runType=VA
- The third line of the run file is used to declare the loss function used by the run. This loss function is specified as a string, and it should be the commonly used name for well-known loss functions. An example of this line is as follows:
loss=categorical crossentropy
Description Generation Subtask
- Teams are allowed to submit up to 4 runs for the subtask.
- Please identify a single submission run as your team's primary run. This is done by adding the string ".primary" to the run name.
- The run file format is:     <Video_ID>         <Confidence_Score>         <Description_Sentence>
General Instructions
- All automatic text descriptions should be in English.
- The submission page will be posted here at a later date.
Evaluation and Metrics:
Description Generation Task
- Scoring results will be automatic using the standard metrics such as METEOR, BLEU, CIDEr, and SPICE. Additional automatic metrics may also be used.
- Direct Assessment (DA) will be used to score the primary runs of each team. This metric uses crowd workers to score the descriptions.
- A semantic similarity metric (STS) will be used to measure how semantically the system generated description is related to the ground truth sentences.
- Systems are encouraged to also take into consideration and use the four facets that annotators used as a guideline to generate their automated descriptions.
News magazine, science news, news reports, documentaries, educational programming, and archival video
TV Episodes
Airport Security Cameras & Activity Detection
Video collections from News, Sound & Vision, Internet Archive,
Social Media, BBC Eastenders