Deep Video Understanding (DVU)
Deep video understanding is a difficult task which requires computer vision systems to develop a deep analysis and understanding
of the relationships between different entities in video, to use known information to reason about other, more hidden information, and
to populate a knowledge graph (KG) with all acquired information. The aim of the proposed task is to push the limits of multimedia analysis techniques to address
analysing long duration videos holistically and extract useful knwledge to utilize it in solving different kind of queries.
The knowledge in the target queries includes both visual and non-visual elements. Participating systems should take into consideration all available modalities
(speech, image/video, and in some cases text).
As movies provide an excellent testbed to provide the needed data because they can simulate
the real world (people, relationships, locations, actions and interactions, motivations, intentions, etc) the DVU task is exercising it's
challenge on the movie domain. As videos and multimedia data are getting more and more popular and usable by users in different domains, the research,
approaches and techniques we aim to be applied in this task will be very relevant in the coming years and near future.
System Task
The task for participating researchers will be: given a whole original
movie (e.g 1.5 - 2hrs long),
image snapshots of main entities (persons, locations, and concepts) per movie, and
ontology of relationships, interactions, locations, and sentiments used
to annotate each movie at global movie-level (relationships between entities) as well as on fine-grained scene-level (scene sentiment, interactions between characters, and locations of scenes),
systems are expected to generate a knowledge-base of the main actors and their relations (such as family, work, social, etc) over the whole movie, and of interactions between them over the scene level.
This representation can be used to answer a set of queries on the movie-level and/or scene-level (see below details about query types) per movie.
The task will support two tracks (subtasks) where teams can join one or both tracks. Movie track where participants are asked queries on the whole movie level, and Scene track where Queries
are targeted towards specific movie scenes.
The DVU challenge is also running externally at the ACM Multimedia as a grand challenge. Participants are encouraged to also take part
in the challenge to get exposed to more comprehensive query types and be able to submit their solution as a publication in the conference proceedings.
Note the schedule for the grand challenge is different than the TRECVID DVU task schedule. The organizers will be doing their best efforts
to unify the testing dataset used at TRECVID and the ACM MM Grand Challenge. For detailed information, please check the DVU grand challenge website:
https://sites.google.com/view/dvuchallenge2022/
In addition to the grand challenge at ACM Multimedia, the organizers are also running a DVU related workshop at the 24th ACM International Conference on Multimodal Interaction (7-11 Nov 2022).
All teams are invited to submit a paper of their work at the workshop. Papers will be peer reviewed and included in the conference proceedings.
The Deep Video Understanding workshop website: https://sites.google.com/view/dvu2022-workshop
Data Resources
The Development Dataset
A set of 14 Creative Common (CC) movies (total duration of 17.5 hr) previously utilized in 2020 and 2021 ACM Multimedia DVU Grand Challenges including their
movie-level and scene-level annotations. The movies have been collected from public websites such as Vimeo and the Internet Archive.
In total, the 14 movies from diverse genres consist of 621 scenes, 1572 entities, 650 relationships, and 2491 interactions.
The development dataset can be accessed from this URL.
Please consult the documentation folder readme files for more information on the contents of the dataset.
The Testing Dataset
A set of 6 new movies will be distributed to participating teams. The movies have been licensed by NIST from the Kinolorberedu platform.
All task participants will be able to download the movies after signing a data agreement. Please refer to the TRECVID 2022 schedule
for availability of testing data, queries, and run submissions.
Subtasks and Query types
Metrics
- Movie-level : question answering
Scores for this query will be calculated by the number of Correct Answers / number of Total Questions.
- Movie-level : fill in the graph space
Results will be treated as ranked list of result items per each unknown variable and the Reciprocal Rank score
will be calculated per unknown variable and Mean Reciprocal Rank (MRR) per query.
- Scene-level : find next or previous interaction
Scores for this query will be calculated by the number of Correct Answers / number of Total Questions.
- Scene-level : find the unique scene
Results will be treated as ranked list of result items per each unknown variable and the Reciprocal Rank score
will be calculated per unknown variable and Mean Reciprocal Rank (MRR) per query.
Run submission format
Each participating team can submit up to 4 runs per track (movie or scene). Each run should contain results for all queries in the testing dataset.
Please see the provided DTD files for run formats of both
movie-level and
scene-level results.
Also, a small xml example for
movie-level run and
scene-level run.
The below are sample queries and responses of movie and scene level queries:
- Movie-level question answering sample query:
- Movie-level question answering Sample response:
- Movie-level fill in the graph sample query:
- Movie-level fill in the graph Sample response:
- Scene-level next interaction sample query:
- Scene-level next interaction Sample response:
- Scene-level previous interaction sample query:
- Scene-level previous interaction Sample response:
- Scene-level find unique scene sample query:
- Scene-level find unique scene Sample response:


News magazine, science news, news reports, documentaries, educational programming, and archival video

TV Episodes

Airport Security Cameras & Activity Detection

Video collections from News, Sound & Vision, Internet Archive,
Social Media, BBC Eastenders