— TRECVID 2018 guidelines

Video to Text Description

Task coordinator: Asad Butt, Alan Smeaton, and Yvette Graham

Automatic annotation of videos using natural language text descriptions has been a long-standing goal of computer vision. The task involves understanding of many concepts such as objects, actions, scenes, person-object relations, temporal order of events and many others. In recent years there has been major advances in computer vision techniques which enabled researchers to start practically to work on solving such problem. A lot of use case application scenarios can greatly benefit from such technology such as video summarization in the form of natural language, facilitating the search and browsing of video archives using such descriptions, describing videos to the blind, etc. In addition, learning video interpretation and temporal relations of events in the video will likely contribute to other computer vision tasks, such as prediction of future events from the video.

Video Dataset

Who is the video describing such as concrete objects and beings (kinds of persons, animals, things)
What are the objects and beings doing? (generic actions, conditions/state or events)
Where such as locale,site,place,geographic, architectural (kind of place, geographic or architectural)
When such as time of day, season
The 2016 and 2017 testing data are available and can be used by participating systems as a training data.

System Task
Given a set of X urls of Vine videos and Y sets of text descriptions (each composed of X sentences), systems are asked to work and submit results for two subtasks:
Matching and Ranking:
- Return for each video URL a ranked list of the most likely text description that correspond (was annotated) to the video from each of the Y sets.
- Scoring results will be automatic against the ground truth using mean inverted rank at which the annotated item is found or equivalent.
Description Generation:
- Automatically generate for each video URL a text description (1 sentence) independently and without taking into consideration the existence of the Y sets.
- Scoring results will be automatic using the standard metrics from machine translation such as METEOR, BLEU, and CIDEr.
- A semantic similarity metric will be used to measure how semantically the system generated description is related to the ground truth sentences.
- Systems are encouraged to also take into consideration and use the four facets that annotators used as a guideline to generate their automated descriptions.
Run Types
Systems are required to choose between two run types based on the type of training data they used:
Run type 'V' : Traning using Vine videos (can be TRECVID provided or non-TRECVID VIne data).
Run type 'N' : Training using only non VIne videos.
Run Submissions
Each run file has to start the first line declaring the run type as in the example below for V runs:
For the Descripotion Generation task, please identify a single submission run as your team's primary submission run out of the 4 allowed runs.
For each testing subset in the "Matching and Ranking" subtask, systems are allowed to submit up to 4 runs for each description set (A, B, etc).
Please use the strings ".A.", ".B.", etc as part of your run file names to differentiate between different description sets run files in the "Matching and Ranking" subtask..
A run should include results for all the testing video URLS (no missing video URL_ID will be allowed).
No duplicate result pairs of "rank" AND "URL_ID" are allowed (please submit only 1 unique set of ranks per URL_ID).
All automatic text descriptions should be in English.
Please validate your run submissions for errors using the provided submission checkers
Submissions will be transmitted to NIST via a password-protected webpage

Issues:

No issues.

Video to Text Description

Task coordinator: Asad Butt, Alan Smeaton, and Yvette Graham

Video Dataset

System Task

Run Types

Run Submissions

Issues: