— TRECVID 2017 guidelines

Pilot Task: Video to Text Description

Task coordinator: George Awad, Alan Smeaton, and Yvette Graham

Automatic annotation of videos using natural language text descriptions has been a long-standing goal of computer vision. The task involves understanding of many concepts such as objects, actions, scenes, person-object relations, temporal order of events and many others. In recent years there has been major advances in computer vision techniques which enabled researchers to start practically to work on solving such problem. A lot of use case application scenarios can greatly benefit from such technology such as video summarization in the form of natural language, facilitating the search and browsing of video archives using such descriptions, describing videos to the blind, etc. In addition, learning video interpretation and temporal relations of events in the video will likely contribute to other computer vision tasks, such as prediction of future events from the video.

Video Dataset

Who is the video describing such as concrete objects and beings (kinds of persons, animals, things)
What are the objects and beings doing? (generic actions, conditions/state or events)
Where such as locale,site,place,geographic, architectural (kind of place, geographic or architectural)
When such as time of day, season
The 2016 testing data and a readme file about submission formats can be found here. These data can be used by participating systems as a training data.

System Task
Given a set of X urls of Vine videos and Y sets of text descriptions (each composed of X sentences), systems are asked to work and submit results for two subtasks:
Matching and Ranking:
- Return for each video URL a ranked list of the most likely text description that correspond (was annotated) to the video from each of the Y sets.
- Scoring results will be automatic against the ground truth using mean inverted rank at which the annotated item is found or equivalent.
Description Generation:
- Automatically generate for each video URL a text description (1 sentence) independently and without taking into consideration the existence of the Y sets.
- Scoring results will be automatic using the standard metrics from machine translation such as METEOR, BLEU, and CIDEr.
- A semantic similarity metric will be used to measure how semantically the system generated description is related to the ground truth sentences.
- Systems are encouraged to also take into consideration and use the four facets that annotators used as a guideline to generate their automated descriptions.
Run Submissions
For the Descripotion Generation task, please identify a single submission run as your team's primary submission run out of the 4 allowed runs.
For each testing subset in the "Matching and Ranking" subtask, systems are allowed to submit up to 4 runs for each description set (A, B, etc).
Please use the string "set.2.", "set.3.", etc as part of your run file names to differentiate between different testing subsets in the "Matching and Ranking" subtask.
Please use the strings ".A.", ".B.", etc as part of your run file names to differentiate between different description sets run files in the "Matching and Ranking" subtask..
A run should include results for all the testing video URLS (no missing video URL_ID will be allowed).
No duplicate result pairs of "rank" AND "URL_ID" are allowed (please submit only 1 unique set of ranks per URL_ID). Please consult the readme file for more information regarding submission format.
All automatic text descriptions should be in English.
Please validate your run submissions for errors using the submission checkers here
Submissions will be transmitted to NIST via a password-protected webpage

Issues:

Finalizing number Vine URLs for testing and number of annotation sets [RESOLVED].

Pilot Task: Video to Text Description

Task coordinator: George Awad, Alan Smeaton, and Yvette Graham

Video Dataset

System Task

Run Submissions

Issues: