Video to Text Description (VTT)
Automatic annotation of videos using natural language text descriptions has been a long-standing goal
of computer vision. The task involves understanding of many concepts such as objects, actions, scenes,
person-object relations, temporal order of events and many others. In recent years there have been major
advances in computer vision techniques that enabled researchers to try to solve this problem. A lot of use case application scenarios can greatly benefit from such technology
such as video summarization in the form of natural language, facilitating the search and browsing of
video archives using such descriptions, describing videos to the blind, etc. In addition, learning video
interpretation and temporal relations of events in the video will likely contribute to other computer
vision tasks, such as prediction of future events from the video.
The VTT Task has been divided into two subtasks:
- Description Generation Subtask
- Matching and Ranking Subtask
Starting in 2019, the description generation subtask has been designated as a core/mandatory
subtask, whereas the matching and ranking subtask is optional
for all VTT task participants.
Description Generation Subtask (Core):
Given a set of X short videos (up to 10 sec long), the subtask is as follows:
- For each video, automatically generate a text description (1 sentence) independently and without taking into consideration the existence of any annotated descriptions for the videos.
Matching and Ranking Subtask (Optional):
Given a set of X short videos, along with 5 sets of text descriptions (each composed of X sentences), the subtask is as follows:
- Return for each video a ranked list of the most likely text description that corresponds (was annotated) to the video from each of the 5 sets.
The Testing Dataset
Unlike previous years, the VTT testing dataset will contain
videos from two sources this year.
It is important that
regardless of the source, the dataset is considered as a whole, and
each run file is the output of a single system run over the entire
dataset. Approximately 2000 videos will be randomly selected and
annotated. Each video will be annotated 5 times by 5 different
- Vine Videos: NIST has collected over 50k Twitter Vine videos, where each video
is about 6 seconds long. Participants will receive URLs for the
videos and will download them directly from Vine.
- Flickr Creative Commons: TRECVID Flickr Creative Commons (TFCC) videos will be
available to download from NIST servers. You must sign and return
to access the data. Please sign and return the form as soon as
possible to ensure that you can download the test data as it becomes
available. Each video is less than 10 seconds long.
Annotators will be asked to include and combine in 1 sentence, if appropriate and available, four facets of the video they are describing:
- Who is the video describing? This includes concrete objects and beings (kinds of persons, animals, things, etc.)
- What are the objects and beings doing? This includes generic actions, conditions/state or events.
- Where is the video taken? This can be the locale, site, or place (geographic or architectural).
- When is the video taken? For example, the time of the day, or the season, etc.
The distribution of the testing dataset will be announced via the active participants mailing list and according to the published schedule.
Run Submission Types
Systems are required to choose between run types based on the type of training data used:
- Run type 'I': Only image captioning datasets were used for training.
- Run type 'V': Only video captioning datasets were used for training.
- Run type 'B': Both image and video captioning datasets were used for training.
Run Submission Format:
- Each run file must use the first line to declare the task. As
described above, the subtasks are:
The first line for a run file for the description generation subtask then
is as follows:
- Task "D": Description Generation Subtask
- Task "M": Matching and Ranking Subtask
- The second line of the run file is used to declare the run type. For example, the run type 'I' is indicated using the following format:
- Please use the run validators available from the active participants area here to validate your runs.
Description Generation Subtask
- Teams are allowed to submit up to 4 runs for the subtask.
- Please identify a single submission run as your team's primary run. This is done by adding the string ".primary" to the run name.
- The run file format is: URL_ID Description_Sentence
Matching and Ranking Subtask
- Systems are allowed to submit up to 4 runs for each description set (A, B, ..., E).
- Please use the strings ".A.", ".B.", etc as part of your run file names to differentiate between different description sets run files in the "Matching and Ranking" subtask.
- The run file format is: URL_ID Rank Description_ID
- A run should include results for all the testing video URLS (no missing video URL_ID will be allowed).
- No duplicate result pairs of "Rank" AND "URL_ID" are allowed (please submit only 1 unique set of ranks per URL_ID).
Evaluation and Metrics:
Description Generation Task
- Scoring results will be automatic using the standard metrics from machine translation such as METEOR, BLEU, and CIDEr.
- Direct Assessment (DA) may be used to score the primary runs of each team. This metric uses crowd workers to score the descriptions.
- A semantic similarity metric will be used to measure how semantically the system generated description is related to the ground truth sentences.
- Systems are encouraged to also take into consideration and use the four facets that annotators used as a guideline to generate their automated descriptions.
Matching and Ranking Subtask
- Scoring results will be automatic against the ground truth using the mean inverted rank metric, which uses the rank the system gave to the correct annotation for a video.
- This year the dataset has been augmented to
include more video sources. Vine videos will be distributed as URLs,
whereas the remaining videos will be downloaded from NIST servers. Systems will give outputs for
the entire dataset regardless of the source of the videos. Details
are in the Data Resources section.
News magazine, science news, news reports, documentaries, educational programming, and archival video
Airport Security Cameras & Activity Detection
Video collections from News, Sound & Vision, Internet Archive,
Social Media, BBC Eastenders