Video to Text Description
Automatic annotation of videos using natural language text descriptions has been a long-standing goal
of computer vision. The task involves understanding of many concepts such as objects, actions, scenes,
person-object relations, temporal order of events and many others. In recent years there has been major
advances in computer vision techniques which enabled researchers to start practically to work on
solving such problem. A lot of use case application scenarios can greatly benefit from such technology
such as video summarization in the form of natural language, facilitating the search and browsing of
video archives using such descriptions, describing videos to the blind, etc. In addition, learning video
interpretation and temporal relations of events in the video will likely contribute to other computer
vision tasks, such as prediction of future events from the video.
A dataset of more than 50k Twitter Vine videos have been collected by NIST. Each has a total duration of about
6 sec long. In this task a subset of about 2000 Vine videos will be randomly selected annotated.
Each video will be annotated Y times (where Y <= 5) by different Y annotators.
Annotators will be asked to include and combine in 1 sentence, if appropriate and available, four facets of the video they are describing:
- Who is the video describing such as concrete objects and beings (kinds of persons, animals, things)
- What are the objects and beings doing? (generic actions, conditions/state or events)
- Where such as locale,site,place,geographic, architectural (kind of place, geographic or architectural)
- When such as time of day, season
The 2016 and 2017 testing data are available and can be used by participating systems as a training data.
Given a set of X urls of Vine videos and Y sets of text descriptions (each composed of X sentences), systems are asked to work and submit results for two subtasks:
Matching and Ranking:
- Return for each video URL a ranked list of the most likely text description that correspond (was
annotated) to the video from each of the Y sets.
- Scoring results will be automatic against
the ground truth using mean inverted rank at which the annotated item is found or equivalent.
- Automatically generate for each video URL a text description (1 sentence) independently and
without taking into consideration the existence of the Y sets.
- Scoring results will be automatic using the standard metrics from machine translation such as METEOR, BLEU, and CIDEr.
- A semantic similarity metric will be used to measure how semantically the system generated description is related to the ground truth sentences.
- Systems are encouraged to also take into consideration and use the four facets that annotators used as a guideline to generate their automated descriptions.
Systems are required to choose between two run types based on the type of training data they used:
- Run type 'V' : Traning using Vine videos (can be TRECVID provided or non-TRECVID VIne data).
- Run type 'N' : Training using only non VIne videos.
- Each run file has to start the first line declaring the run type as in the example below for V runs:
- For the Descripotion Generation task, please identify a single submission run as your team's primary submission run out of the 4 allowed runs.
- For each testing subset in the "Matching and Ranking" subtask, systems are allowed to submit up to 4 runs for each description set (A, B, etc).
- Please use the strings ".A.", ".B.", etc as part of your run file names to differentiate between different description sets run files in the "Matching and Ranking" subtask..
- A run should include results for all the testing video URLS (no missing video URL_ID will be allowed).
- No duplicate result pairs of "rank" AND "URL_ID" are allowed (please submit only 1 unique set of ranks per URL_ID).
- All automatic text descriptions should be in English.
- Please validate your run submissions for errors using the provided submission checkers
- Submissions will be transmitted to NIST via a password-protected webpage