Video to Text Description (VTT)

Task Coordinators: George Awad, Asad Butt, Yvette Graham, and Afzal Godil

Automatic annotation of videos using natural language text descriptions has been a long-standing goal of computer vision. The task involves understanding of many concepts such as objects, actions, scenes, person-object relations, temporal order of events and many others. In recent years there have been major advances in computer vision techniques that enabled researchers to try to solve this problem. A lot of use case application scenarios can greatly benefit from such technology such as video summarization in the form of natural language, facilitating the search and browsing of video archives using such descriptions, describing videos to the blind, etc. In addition, learning video interpretation and temporal relations of events in the video will likely contribute to other computer vision tasks, such as prediction of future events from videos.

System Tasks

Description Generation (Main Task):

Given a set of X short videos (between 5 and 15 seconds long), the task is as follows:

For each video, automatically generate a single sentence that best describes the video using natural language.
The systems will also provide a confidence score between 0 and 1 for each generated sentence, which tells us how confident the system is in the "goodness" of the generated sentence. This score will be helpful for analysis and will not be used during the evaluation.
A small fixed set of 300 videos, included in 2021 and 2023, will be included in the testing videos to measure system progress for teams participating in multiple years.

VTT Robustness SubTask (optional subtask):

Building robust multimodal systems are critical for achieving deployment of these systems for real-world applications. Despite their significance, little attention has been paid to detecting and improving the robustness of multimodal systems.

The robustness VTT subtask challenge is focused on developing technology that reduces the gap in performance between training sets and real-world testing cases. The goal of this challenge is to promote algorithm development that can handle the various types of perturbations and corruptions observed in real-world multimodal data.

The robustness experiments will be evaluated based on the system’s performance on the test dataset with natural corruptions and perturbations. The natural perturbations will include spatial corruptions, temporal corruptions, spatio-temporal corruptions, different types of noise, and compression artifacts. We will create the dataset synthetically with a computer program with different levels of corruptions and perturbations to both the audio and video channels.

Robustness Dataset

The robustness dataset will use the same main task testing dataset.
Teams should handle and treat each testing dataset (the main dataset and the robustness dataset) independently. This implies processing them separately without using any knowledge about one of them to affect results of the other dataset.

Data Resources

The Test Dataset

The test dataset will contain video segments from the V3C3 collection, which is part of the Vimeo Creative Commons Collection (V3C). V3C3 consists of 11215 Vimeo videos, which total about 1500 hours. These videos are divided into over 1.6 million segments. For this task, we only use a small subset of the segments, where each segment is between 5 and 15 seconds long.
Approximately 2000 videos will be selected and annotated for the task. Each video will be annotated 5 times.
You must sign and return this agreement to access the data. Please sign and return the form as soon as possible to ensure that you can download the test data as it becomes available.
Annotators will be asked to include and combine in 1 sentence, if appropriate and available, four facets of the video they are describing:
- Who is the video showing? This includes concrete objects and beings (kinds of persons, animals, things, etc.)
- What are the objects and beings doing? This includes but not limited to generic actions, conditions/state, events, and important spoken/heard information
- Where is the video taken? This can be the locale, site, or place (geographic or architectural).
- When is the video taken? For example, the time of the day, or the season, etc.
The distribution of the test dataset will be announced via the active participants mailing list and according to the published schedule.

The Development Dataset

Training dataset is available here. Videos and ground truth from previous VTT tasks have been collected here to make it convenient for participants to access the data. To access the VTT development dataset, please make sure you submit the data agreement

Run Submission Types

Systems are required to choose between run types based on the types of training data and features used. We are concerned with the type of training data used (images and/or video) and the types of features used (visual only or audio+visual). The run type is to be reported in submitted runs as follows:

runtype = <training_data_type> <training_features_type>
The training data types are specified as follows:

I: Only image captioning datasets were used for training.
V: Only video captioning datasets were used for training.
B: Both image and video captioning datasets were used for training.

The feature types are specified as follows:

V: Only visual features are used.
A: Both audio and visual features are used.

Run Submission Format:

The first three lines of each run submission are reserved, and should include the information listed below (preferably in the given order).

Each run file must use the first line to declare the task.
1. D: Description Generation
2. R: Robustness Subtask
Examples: The first line for a main task run file will be:
Task=D
While for the Robustness subtask:
Task=R
The second line of the run file is used to declare the run type. For example, a run that uses only video captioning datasets for training, and uses both audio and visual features is specified as follows:

runType=VA
The third line of the run file is used to declare the loss function used by the run. This loss function is specified as a string, and it should be the commonly used name for well-known loss functions. An example of this line is as follows:

loss=categorical crossentropy
Teams are allowed to submit up to 4 runs in each of the two tasks (main and robustness).
Please identify a single submission run as your team's primary run in the main task. This is done by adding the string ".primary" to the run name.
After the first 3 lines, the run file should include the video descriptions in the following format:
<Video_ID> <Confidence_Score> <Description_Sentence>
A sample run is available: Here

General Instructions

All automatic text descriptions should be in English.
The run submission page link will be posted here at a later date.

Evaluation and Metrics:

Description Generation Task

Scoring results will be automatic using the standard metrics such as METEOR, BLEU, CIDEr, and SPICE. Additional automatic metrics may also be used.
Direct Assessment (DA) will be used to score the primary runs of each team. This metric uses crowd workers to score the descriptions.
A semantic similarity metric (STS) will be used to measure how semantically the system generated description is related to the ground truth sentences.
Systems are encouraged to also take into consideration and use the four facets that annotators used as a guideline to generate their automated descriptions.
Automatic metric scoring as well as direct assessment will be applied to the 300 progress videos to measure system progress over the past 3 years.