Video to Text Description (VTT)

Task Coordinators: Asad Butt and Yvette Graham

Automatic annotation of videos using natural language text descriptions has been a long-standing goal of computer vision. The task involves understanding of many concepts such as objects, actions, scenes, person-object relations, temporal order of events and many others. In recent years there have been major advances in computer vision techniques that enabled researchers to try to solve this problem. A lot of use case application scenarios can greatly benefit from such technology such as video summarization in the form of natural language, facilitating the search and browsing of video archives using such descriptions, describing videos to the blind, etc. In addition, learning video interpretation and temporal relations of events in the video will likely contribute to other computer vision tasks, such as prediction of future events from the video.

System Task

The VTT Task has tradionally been divided into two subtasks. However, this year we will only conduct the main Description Generation task. The Matching and Ranking, and Fill-in-the-Blanks subtasks from previous editions have been discontinued.

Description Generation:

Given a set of X short videos (between 3 and 10 seconds long), the subtask is as follows:

For each video, automatically generate a single sentence that best describes the video using natural language.
The systems will also provide a confidence score between 0 and 1 for each generated sentence, which tells us how confident the system is in the "goodness" of the generated sentence. This score will be helpful for analysis and will not be used during the evaluation.

Data Resources

The Test Dataset

The test dataset will contain video segments from the V3C1 collection, which is part of the Vimeo Creative Commons Collection (V3C). V3C1 consists of 7475 Vimeo videos, which total about 1000 hours. These videos are divided into over 1 million segments. For this task, we only use a small subset of the segments, where each segment is between 3 and 10 seconds long.
Approximately 2000 videos will be selected and annotated for the description generation subtask. Each video will be annotated 5 times.
You must sign and return this agreement to access the data. Please sign and return the form as soon as possible to ensure that you can download the test data as it becomes available.
Annotators will be asked to include and combine in 1 sentence, if appropriate and available, four facets of the video they are describing:
- Who is the video showing? This includes concrete objects and beings (kinds of persons, animals, things, etc.)
- What are the objects and beings doing? This includes generic actions, conditions/state or events.
- Where is the video taken? This can be the locale, site, or place (geographic or architectural).
- When is the video taken? For example, the time of the day, or the season, etc.
The distribution of the test dataset will be announced via the active participants mailing list and according to the published schedule.

The Development Dataset

Training dataset is available here. Videos and ground truth from previous VTT tasks have been collected here to make it convenient for participants to access the data.

Run Submission Types

Systems are required to choose between run types based on the types of training data and features used. We are concerned with the type of training data used (images and/or video) and the types of features used (visual only or audio+visual). The run type is to be written as follows:

runtype = <training_data_type> <training_features_type>
The training data types are specified as follows:

I: Only image captioning datasets were used for training.
V: Only video captioning datasets were used for training.
B: Both image and video captioning datasets were used for training.

The feature types are specified as follows:

V: Only visual features are used.
A: Both audio and visual features are used.

Run Submission Format:

The first three lines of each run submission are reserved, and should include the information listed below (preferably in the given order).

Each run file must use the first line to declare the task.
1. D: Description Generation Subtask
The first line for all run files will be:
Task=D
The second line of the run file is used to declare the run type. For example, a run that uses only video captioning datasets for training, and uses both audio and visual features is specificed as follows:

runType=VA
The third line of the run file is used to declare the loss function used by the run. This loss function is specified as a string, and it should be the commonly used name for well-known loss functions. An example of this line is as follows:

loss=categorical crossentropy

Description Generation Subtask

Teams are allowed to submit up to 4 runs for the subtask.
Please identify a single submission run as your team's primary run. This is done by adding the string ".primary" to the run name.
The run file format is: <Video_ID> <Confidence_Score> <Description_Sentence>

General Instructions

All automatic text descriptions should be in English.
The submission page will be posted here at a later date.

Evaluation and Metrics:

Description Generation Task

Scoring results will be automatic using the standard metrics such as METEOR, BLEU, CIDEr, and SPICE. Additional automatic metrics may also be used.
Direct Assessment (DA) will be used to score the primary runs of each team. This metric uses crowd workers to score the descriptions.
A semantic similarity metric (STS) will be used to measure how semantically the system generated description is related to the ground truth sentences.
Systems are encouraged to also take into consideration and use the four facets that annotators used as a guideline to generate their automated descriptions.

Digital Video Retrieval at NIST