Video to Text Description (VTT)

Task Coordinators: Asad Butt and Yvette Graham

Automatic annotation of videos using natural language text descriptions has been a long-standing goal of computer vision. The task involves understanding of many concepts such as objects, actions, scenes, person-object relations, temporal order of events and many others. In recent years there have been major advances in computer vision techniques that enabled researchers to try to solve this problem. A lot of use case application scenarios can greatly benefit from such technology such as video summarization in the form of natural language, facilitating the search and browsing of video archives using such descriptions, describing videos to the blind, etc. In addition, learning video interpretation and temporal relations of events in the video will likely contribute to other computer vision tasks, such as prediction of future events from the video.

The submission page is available here.

System Task

The VTT Task has been divided into two subtasks:

Description Generation Subtask
Fill-in-the-Blanks Subtask

The Matching and Ranking subtask from previous editions has been discontinued. The Fill-in-the-Blanks subtask is introduced to evaluate systems' understanding of videos and natural language descriptions.

Teams are encouraged to participate in both subtasks, but may participate in either one if they so choose.

Description Generation Subtask:

Given a set of X short videos (between 3 and 10 seconds long), the subtask is as follows:

For each video, automatically generate a single sentence that best describes the video using natural language.
The systems will also provide a confidence score between 0 and 1 for each generated sentence, which tells us how confident the system is in the "goodness" of the generated sentence. This score will be helpful for analysis and will not be used during the evaluation.

Fill-in-the-Blanks Subtask:

Given a set of X short videos and a corresponding description sentence with a blank (denoting a missing word or words), the subtask is as follows:

Return the most appropriate word or words to fill in the blank and complete the sentence for each video. The blank will represent a single concept, but not necessarily a single word. Some examples are below, where the words in italics could be replaced with a blank:
1. A little boy is playing with a dog.
2. A group of people are attending a birthday party.
3. Children throw rocks in a pond and laugh.
The systems will provide a confidence score between 0 and 1 for each generated word. This score will be helpful for analysis and will not be used during the evaluation.

The Fill-in-the-Blanks subtask will be evaluated manually to determine how well the generated word(s) complete the sentence.

Data Resources

The Testing Dataset

The testing dataset will contain video segments from the V3C2 collection, which is part of the Vimeo Creative Commons Collection (V3C). V3C2 consists of 9760 Vimeo videos, which total about 1300 hours. These videos are divided into over 1.4 million segments. For this task, we only use a small subset of the segments, where each segment is between 3 and 10 seconds long.
The testing dataset will be different for both subtasks. Approximately 2000 videos will be selected and annotated for the description generation subtask. Each video will be annotated 5 times. The number of videos for the fill-in-the-blanks subtask is yet to be determined.
You must sign and return this agreement to access the data. Please sign and return the form as soon as possible to ensure that you can download the test data as it becomes available.
Annotators will be asked to include and combine in 1 sentence, if appropriate and available, four facets of the video they are describing:
- Who is the video showing? This includes concrete objects and beings (kinds of persons, animals, things, etc.)
- What are the objects and beings doing? This includes generic actions, conditions/state or events.
- Where is the video taken? This can be the locale, site, or place (geographic or architectural).
- When is the video taken? For example, the time of the day, or the season, etc.
The distribution of the testing dataset will be announced via the active participants mailing list and according to the published schedule.

The Development Dataset

Training dataset is available here. Videos and ground truth from previous VTT tasks have been collected here to make it convenient for participants to access the data.

Run Submission Types

Systems are required to choose between run types based on the types of training data and features used. We are concerned with the type of training data used (images and/or video) and the types of features used (visual only or audio+visual). The run type is to be written as follows:

runtype = <training_data_type> <training_features_type>
The training data types are specified as follows:

I: Only image captioning datasets were used for training.
V: Only video captioning datasets were used for training.
B: Both image and video captioning datasets were used for training.

The feature types are specified as follows:

V: Only visual features are used.
A: Both audio and visual features are used.

Run Submission Format:

The first three lines of each run submission are reserved, and should include the information listed below.

Each run file must use the first line to declare the task. As described above, the subtasks are:
1. D: Description Generation Subtask
2. F: Fill-in-the-Blanks Subtask
The first line for a run file for the description generation subtask then is as follows:

Task=D
The second line of the run file is used to declare the run type. For example, a run that uses only video captioning datasets for training, and uses both audio and visual features is specificed as follows:

runType=VA
The third line of the run file is used to declare the loss function used by the run. This loss function is specified as a string, and it should be the commonly used name for well-known loss functions. An example of this line is as follows:

loss=categorical crossentropy

Description Generation Subtask

Teams are allowed to submit up to 4 runs for the subtask.
Please identify a single submission run as your team's primary run. This is done by adding the string ".primary" to the run name.
The run file format is: <Video_ID> <Confidence_Score> <Description_Sentence>

Fill-in-the-Blanks Subtask

Teams are allowed to submit up to 2 runs for this subtask.
Please identify a single submission run as your team's primary run. This is done by adding the string ".primary" to the run name. Only the primary run will be evaluated manually.
The run file format is: <Video_ID> <Confidence_Score> <Generated_Word(s)>
A run should include exactly one line for each video, i.e., no missing or duplicate "Video_ID" will be allowed.

General Instructions

All automatic text descriptions should be in English.
The submission page is available here.

Evaluation and Metrics:

Description Generation Task

Scoring results will be automatic using the standard metrics such as METEOR, BLEU, CIDEr, and SPICE. Additional automatic metrics may also be used.
Direct Assessment (DA) will be used to score the primary runs of each team. This metric uses crowd workers to score the descriptions.
A semantic similarity metric (STS) will be used to measure how semantically the system generated description is related to the ground truth sentences.
Systems are encouraged to also take into consideration and use the four facets that annotators used as a guideline to generate their automated descriptions.

Fill-in-the-Blanks Subtask

Scoring for the primary run will be done using manual evaluation. Assessors will view the video and its associated sentence with the system generated word to determine how well it fills in the blank. They will provide a score within a predetermined range, such as 0-100.

Changes from Previous Years:

For teams that have previously participated in the VTT task, the following are some of the major changes in this year's task:

The Matching and Ranking subtask has been discontinued.
A new Fill-in-the-Blanks subtask has been introduced. Systems will be provided with a video and a corresponding description with a blank. The systems will generate appropriate word(s) to fill in the blank.

Digital Video Retrieval at NIST