Recent advancements in large multimodal models have significantly improved AI's ability to process and understand complex data across multiple modalities, including text, images, and video. However, true comprehension of video content remains a formidable challenge, requiring AI systems to integrate visual, auditory, and temporal information to answer questions meaningfully. The Video Question Answering (VQA) Challenge aims to rigorously assess the capabilities of state-of-the-art multimodal models in understanding and reasoning about video content. Participants in this challenge will develop and test models that answer a diverse set of questions based on video segments, covering various levels of complexity, from factual retrieval to complex reasoning. The challenge track will serve as a critical evaluation framework to measure progress in video understanding, helping identify strengths and weaknesses in current AI architectures. By fostering innovation in multimodal learning, this track will contribute to advancing AI’s ability to process dynamic visual narratives, enabling more reliable and human-like interaction with video-based information.
Both AG and MC tasks will share the same dataset of testing videos. However, each task will follow different submission deadline to avoid any overlap.
The test dataset will contain video links of approx. 2000 YouTube shorts. More details about domain, quantity, etc will be added later. The distribution of the test dataset will be announced via the active participants mailing list and according to the published schedule.
Topics (queries) for the AG task will take the following format:
Q_ID , Video_ID, Question
Topics (queries) for the MC task will take the following format:
Q_ID , Video_ID, Question, option_1, option_2, option_3, option_4Teams are welcome to utilize any available training and development datasets available in the public domain for research purposes. Example of existing datasets can be found HERE. Coordinators will announce any plans for release of dedicated training/validation datasets for this track.
All runs should be submitted in csv comma delimited file format as shown below for both tasks.
Runs for the AG task should follow this submission format:
Q_ID , Video_ID, Rank, Answer, Time (sec)Runs for the MC task should follow this submission format:
Q_ID , Video_ID, Rank, option_X
News magazine, science news, news reports, documentaries, educational programming, and archival video
TV Episodes
Airport Security Cameras & Activity Detection
Video collections from News, Sound & Vision, Internet Archive,
Social Media, BBC Eastenders