Deep Video Understanding Grand Challenge Queries

Overview

The Deep Video Understanding (DVU) Grand Challenge testing queries can be found in the following directory structure:

In addition to the above folders, there is a MOVIE_NAME.entity.types.txt file for each movie which shows the type of each key entity in the movie (persons, locations, animals). Each entity name should have corresponding image examples in the "images" folder. Also, queries at the movie and scene level use those exact same names to refer to persons or entities (locations) in the movie.

All relations between entities (person to person, person to location) are to be selected from the relationship ontology provided:

  1. Please see the movie-level relationships ontology: https://www-nlpir.nist.gov/projects/trecvid/dvu/dvu.development.dataset/movie_knowledge_graph/
  2. Please see the scene-level ontology of relationships, interactions, locations, and sentiments:
    https://www-nlpir.nist.gov/projects/trecvid/dvu/dvu.development.dataset/vocab.dvu.json
    The relationships in this json file includes all relationships in the movie-level relationships ontology
All interactions between persons to each other are to be selected from the vocab.dvu.json file provided. All sentiments of scenes are to be selected from the vocab.dvu.json file provided. Please note that the use of the term "entity" in this file refers to a location in the above entity.types files. A sample XML and DTD will be provided/updated and should be followed to correctly format your run submissions (Please check the website for updates).

New for 2022

New in 2022, the DVU challenge is introducing an evaluation element to address explainability of results within the Video Understanding domain. To that end, Some movie-level and scene-level queries were selected to require systems to also submit with the results either a scene id (to reflect which scene the system believes to best provides evidence of a relation asked in a movie-level relationship question (Query type 3)), or a temporal segment (start time & end time) to localize the chosen interaction in a scene for query types 3 and 4.

Query types

The following is an overview of the movie-level and scene-level query types:

Movie-level

Movie-level Query type 1:

Find all possible paths question. The task for systems will be to list all possible paths from the source person to the target person.

Movie-level Query type 2:

Fill in the part of graph question. The task for systems will be to identify the person / entity labelled Unknown_#. All of Unknown's relations with other people / entities / concepts are listed. In cases where one of these related nodes occurs more than once in the part of graph questions, That node's name has been replaced with <BLANK>. Therefore any nodes labelled <BLANK> are guaranteed to be one of the nodes named in this group of questions. The subject type will always be the source person we are asking about. The predicate will always be that person’s relation with another person, entity, or concept. The subject in this question always contains the Unknown you are being asked to identify.

Movie-level Query type 3:

Multiple choice questions. The task for systems will be to identify the correct answer for Unknown out of the 6 possible answers provided. The subject type will always be the source person we are asking about. The predicate will always be that person’s relation with another person, entity, or concept. In this question the Unknown you are being asked to identify will always be in either the predicate or object. NOTE: For questions that ask for relationships, there is an attribute called 'scene' (e.g. scene="") to be used by systems when submitting the final answer to indicate the scene id that best represent the relationship they selected.

Example: <item type="Relation" answer="Teacher_At" scene="6"/>

**************************************************************************************************************

Scene-level

Scene-level Query type 1:

Find the unique scene. Given a full, inclusive list of interactions unique to a specific scene in the movie, teams should find which scene this is. The subject type will always be the scene needed to be identified: Teams will return the scene id, based on the segmented scenes reference files (csv files) and/or segmented movie shots.

Scene-level Query type 2:

Fill in the graph space. Find the person (labelled Unknown_#) in a specific scene with the following interactions with others. Teams will be given a scene number, and a list of interactions to and from other people. Teams should find the only person in that scene with those interactions.

Scene-level Query type 3:

Find next interaction in scene X between person Y and person Z Given a specific scene X and a specific interaction between person Y and person Z, participants will be asked to select either the next interaction between person Y and Person Z in scene X or X + N, from a set of multiple choice options of different interactions. The two persons in this question are always subjects and objects. While the interaction is the predicate. NOTE: Two questions in this query type are selected to ask systems to submit a start time and end time of the interaction they selected for their answer. The xml attributes are: 'start_time' and 'end_time' The timestamps should be relevant to the scene in the question and in seconds (expired since begnnning of the scene).

Example: <item type="Interaction" answer="talks to" start_time="30" end_time="35"/>

Scene-level Query type 4:

Find previous interaction in scene X between person Y and person Z Given a specific scene X and a specific interaction between person Y and person Z, participants will be asked to select either the previous interaction between person Y and Person Z in scene X or X - N, from a set of multiple choice options of different interactions. The two persons in this question are always subjects and objects. While the interaction is the predicate. NOTE: Two questions in this query type are selected to ask systems to submit a start time and end time of the interaction they selected for their answer. The xml attributes are: 'start_time' and 'end_time' The timestamps should be relevant to the scene in the question and in seconds (expired since begnnning of the scene).

Example: <item type="Interaction" answer="talks to" start_time="30" end_time="35"/>

Scene-level Query type 5:

Find the 1-to-1 relationship between scenes and natural language descriptions. Given a natural language description, find the correct scene that matches this description. Possible scenes will be given in a multiple choice options. The textual description of the scene will be defined in the <item description>

Scene-level Query type 6:

Classify scene sentiment from a given scene. Given a specific movie scene and a set of possible sentiments, classify the correct sentiment label for each given scene.