Video to Text Description (VTT)

Task Coordinators: Asad Butt, Alan smeaton, and Yvette Graham

Automatic annotation of videos using natural language text descriptions has been a long-standing goal of computer vision. The task involves understanding of many concepts such as objects, actions, scenes, person-object relations, temporal order of events and many others. In recent years there have been major advances in computer vision techniques that enabled researchers to try to solve this problem. A lot of use case application scenarios can greatly benefit from such technology such as video summarization in the form of natural language, facilitating the search and browsing of video archives using such descriptions, describing videos to the blind, etc. In addition, learning video interpretation and temporal relations of events in the video will likely contribute to other computer vision tasks, such as prediction of future events from the video.

System Task

The VTT Task has been divided into two subtasks:

Description Generation Subtask
Matching and Ranking Subtask

Starting in 2019, the description generation subtask has been designated as a core/mandatory subtask, whereas the matching and ranking subtask is optional for all VTT task participants.

Description Generation Subtask (Core):

Given a set of X short videos (up to 10 sec long), the subtask is as follows:

For each video, automatically generate a text description (1 sentence) independently and without taking into consideration the existence of any annotated descriptions for the videos.

Matching and Ranking Subtask (Optional):

Given a set of X short videos, along with 5 sets of text descriptions (each composed of X sentences), the subtask is as follows:

Return for each video a ranked list of the most likely text description that corresponds (was annotated) to the video from each of the 5 sets.

Data Resources

The Testing Dataset

Unlike previous years, the VTT testing dataset will contain videos from two sources this year.
1. Vine Videos: NIST has collected over 50k Twitter Vine videos, where each video is about 6 seconds long. Participants will receive URLs for the videos and will download them directly from Vine.
2. Flickr Creative Commons: TRECVID Flickr Creative Commons (TFCC) videos will be available to download from NIST servers. You must sign and return this agreement to access the data. Please sign and return the form as soon as possible to ensure that you can download the test data as it becomes available. Each video is less than 10 seconds long.
It is important that regardless of the source, the dataset is considered as a whole, and each run file is the output of a single system run over the entire dataset. Approximately 2000 videos will be randomly selected and annotated. Each video will be annotated 5 times by 5 different annotators.
Annotators will be asked to include and combine in 1 sentence, if appropriate and available, four facets of the video they are describing:
- Who is the video describing? This includes concrete objects and beings (kinds of persons, animals, things, etc.)
- What are the objects and beings doing? This includes generic actions, conditions/state or events.
- Where is the video taken? This can be the locale, site, or place (geographic or architectural).
- When is the video taken? For example, the time of the day, or the season, etc.
The distribution of the testing dataset will be announced via the active participants mailing list and according to the published schedule.

The Development Dataset

The 2016, 2017 and 2018 testing data are available and can be used by participating systems as training data.

Run Submission Types

Systems are required to choose between run types based on the type of training data used:

Run type 'I': Only image captioning datasets were used for training.
Run type 'V': Only video captioning datasets were used for training.
Run type 'B': Both image and video captioning datasets were used for training.

Run Submission Format:

Each run file must use the first line to declare the task. As described above, the subtasks are:
1. Task "D": Description Generation Subtask
2. Task "M": Matching and Ranking Subtask
The first line for a run file for the description generation subtask then is as follows: Task=D
The second line of the run file is used to declare the run type. For example, the run type 'I' is indicated using the following format:
Please use the run validators available from the active participants area here to validate your runs.

Description Generation Subtask

Teams are allowed to submit up to 4 runs for the subtask.
Please identify a single submission run as your team's primary run. This is done by adding the string ".primary" to the run name.
The run file format is: URL_ID Description_Sentence

Matching and Ranking Subtask

Systems are allowed to submit up to 4 runs for each description set (A, B, ..., E).
Please use the strings ".A.", ".B.", etc as part of your run file names to differentiate between different description sets run files in the "Matching and Ranking" subtask.
The run file format is: URL_ID Rank Description_ID
A run should include results for all the testing video URLS (no missing video URL_ID will be allowed).
No duplicate result pairs of "Rank" AND "URL_ID" are allowed (please submit only 1 unique set of ranks per URL_ID).

General Instructions

All automatic text descriptions should be in English.
Please validate your run submissions for errors using the provided submission checkers here.
Please Submit your runs to NIST using this password-protected webpage.

Evaluation and Metrics:

Description Generation Task

Scoring results will be automatic using the standard metrics from machine translation such as METEOR, BLEU, and CIDEr.
Direct Assessment (DA) may be used to score the primary runs of each team. This metric uses crowd workers to score the descriptions.
A semantic similarity metric will be used to measure how semantically the system generated description is related to the ground truth sentences.
Systems are encouraged to also take into consideration and use the four facets that annotators used as a guideline to generate their automated descriptions.

Matching and Ranking Subtask

Scoring results will be automatic against the ground truth using the mean inverted rank metric, which uses the rank the system gave to the correct annotation for a video.

Updates:

This year the dataset has been augmented to include more video sources. Vine videos will be distributed as URLs, whereas the remaining videos will be downloaded from NIST servers. Systems will give outputs for the entire dataset regardless of the source of the videos. Details are in the Data Resources section.

Digital Video Retrieval at NIST