Video to Text Description (VTT)

Task Coordinators: Asad Butt and Yvette Graham

Automatic annotation of videos using natural language text descriptions has been a long-standing goal of computer vision. The task involves understanding of many concepts such as objects, actions, scenes, person-object relations, temporal order of events and many others. In recent years there have been major advances in computer vision techniques that enabled researchers to try to solve this problem. A lot of use case application scenarios can greatly benefit from such technology such as video summarization in the form of natural language, facilitating the search and browsing of video archives using such descriptions, describing videos to the blind, etc. In addition, learning video interpretation and temporal relations of events in the video will likely contribute to other computer vision tasks, such as prediction of future events from the video.

System Task

The VTT Task has been divided into two subtasks:

Description Generation Subtask
Matching and Ranking Subtask

Starting in 2019, the description generation subtask has been designated as a core/mandatory subtask, whereas the matching and ranking subtask is optional for all VTT task participants.

Description Generation Subtask (Core):

Given a set of X short videos (between 3 and 10 seconds long), the subtask is as follows:

For each video, automatically generate a text description (1 sentence) independently and without taking into consideration the existence of any annotated descriptions for the videos.
The systems also provide a confidence score between 0 and 1 for each generated sentence, which tells us how confident the system is in the "goodness" of the generated sentence. This is a new addition to the task to help us with analysis and to get an insight into how the systems work. This score will not be used during the evaluation.

Matching and Ranking Subtask (Optional):

Given a set of X short videos, along with 5 sets of text descriptions (each composed of at least X sentences), the subtask is as follows:

Return for each video a ranked list of the most likely text description that corresponds (was annotated) to the video from each of the 5 sets.
The text description sets this year will contain extra sentences that do not match with any of the videos in the dataset. This does not have any affect on the task, which remains unchanged. The ranking positions of these sentences will give us some insight into the working of the systems.

Data Resources

The Testing Dataset

A new dataset will be introduced for VTT this year. The dataset will be selected from a large collectiion known as V3C2, which is part of the Vimeo Creative Commons Collection (V3C). V3C2 consists of 9760 Vimeo videos, which total about 1300 hours. These videos are divided into over 1.4 million segments. For this task, we only use a small subset of the segments, where each segment is between 3 and 10 seconds long. Participants will receive approximately 2000 testing videos each year.
You must sign and return this agreement to access the data. Please sign and return the form as soon as possible to ensure that you can download the test data as it becomes available. Approximately 2000 videos will be selected and annotated. Each video will be annotated 5 times.
Annotators will be asked to include and combine in 1 sentence, if appropriate and available, four facets of the video they are describing:
- Who is the video showing? This includes concrete objects and beings (kinds of persons, animals, things, etc.)
- What are the objects and beings doing? This includes generic actions, conditions/state or events.
- Where is the video taken? This can be the locale, site, or place (geographic or architectural).
- When is the video taken? For example, the time of the day, or the season, etc.
The distribution of the testing dataset will be announced via the active participants mailing list and according to the published schedule.

The Development Dataset

Training dataset is available here. Videos and ground truth from previous VTT tasks have been collected here to make it convenient for participants to access the data.

Run Submission Types

Systems are required to choose between run types based on the types of training data and features used. We are concerned with the type of training data used (images and/or video) and the types of features used (visual only or audio+visual). The run type is to be written as follows:

runtype = <training_data_type> <training_features_type>
The training data types are specified as follows:

I: Only image captioning datasets were used for training.
V: Only video captioning datasets were used for training.
B: Both image and video captioning datasets were used for training.

The feature types are specified as follows:

V: Only visual features are used.
A: Both audio and visual features are used.

Run Submission Format:

The first three lines of each run submission are reserved, and should include the information listed below.

Each run file must use the first line to declare the task. As described above, the subtasks are:
1. D: Description Generation Subtask
2. M: Matching and Ranking Subtask
The first line for a run file for the description generation subtask then is as follows:

Task=D
The second line of the run file is used to declare the run type. For example, a run that uses only video captioning datasets for training, and uses both audio and visual features is specificed as follows:

runType=VA
The third line of the run file is used to declare the loss function used by the run. This loss function is specified as a string, and it should be the commonly used name for well-known loss functions. An example of this line is as follows:

loss=categorical crossentropy

Description Generation Subtask

Teams are allowed to submit up to 4 runs for the subtask.
Please identify a single submission run as your team's primary run. This is done by adding the string ".primary" to the run name.
The run file format is: <Video_ID> <Confidence_Score> <Description_Sentence>

Matching and Ranking Subtask

Systems are allowed to submit up to 4 runs for each description set (A, B, ..., E).
Please use the strings ".A.", ".B.", etc as part of your run file names to differentiate between different description sets run files in the "Matching and Ranking" subtask.
The run file format is: <Video_ID> <Rank> <Description_ID>
A run should include results for all the testing videos (no missing Video_ID will be allowed).
No duplicate result pairs of "Rank" AND "Video_ID" are allowed (please submit only 1 unique set of ranks per Video_ID).

General Instructions

All automatic text descriptions should be in English.
Submission details will be made available soon.

Evaluation and Metrics:

Description Generation Task

Scoring results will be automatic using the standard metrics from machine translation such as METEOR, BLEU, and CIDEr. Additional automatic metrics may also be used.
Direct Assessment (DA) may be used to score the primary runs of each team. This metric uses crowd workers to score the descriptions.
A semantic similarity metric will be used to measure how semantically the system generated description is related to the ground truth sentences.
Systems are encouraged to also take into consideration and use the four facets that annotators used as a guideline to generate their automated descriptions.

Matching and Ranking Subtask

Scoring results will be automatic against the ground truth using the mean inverted rank metric, which uses the rank the system gave to the correct annotation for a video.

Changes from Previous Years:

For teams that have previously participated in the VTT task, the following are some of the major changes in this year's task:

A new dataset is being introduced. We plan to continue this dataset for future iterations of this task.
For the Description Generation subtask, systems will now also report a confidence score for each generated sentence. This will help us to analyze and better understand how the systems work. This confidence score will not be used to evaluate the systems.
For the Matching and Ranking subtask, the 5 sets of text descriptions will contain more sentences than the number of videos. The ranking will still be done in the same manner as previously.

Digital Video Retrieval at NIST