Video Question Answering (VQA)

Task Coordinators: George Awad, Sanjay Purushotham, Yvette Graham, and Afzal Godil

Recent advancements in large multimodal models have significantly improved AI's ability to process and understand complex data across multiple modalities, including text, images, and video. However, true comprehension of video content remains a formidable challenge, requiring AI systems to integrate visual, auditory, and temporal information to answer questions meaningfully. The Video Question Answering (VQA) Challenge aims to rigorously assess the capabilities of state-of-the-art multimodal models in understanding and reasoning about video content. Participants in this challenge will develop and test models that answer a diverse set of questions based on video segments, covering various levels of complexity, from factual retrieval to complex reasoning. The challenge track will serve as a critical evaluation framework to measure progress in video understanding, helping identify strengths and weaknesses in current AI architectures. By fostering innovation in multimodal learning, this track will contribute to advancing AI’s ability to process dynamic visual narratives, enabling more reliable and human-like interaction with video-based information.

System Tasks

Answer Generation (AG)

Given a set of X short videos (approx. 30 sec long) and a set of questions (one per video) the task is as follows:

For each video, automatically generate up to 10 textual answers, in a ranked list format, to the provided question using natural language. For each answer, also report the real-time (in seconds) spent to generate the answer

Multiple Choice (MC)

Given a set of X short videos (approx. 30 sec long) and a set of question-answers (QA) pairs for each video, the task is as follows:

For each video, automatically sort the list of best answer options from most likely correct answer to least likely correct answer option.

Both AG and MC tasks will share the same dataset of testing videos. However, each task will follow different submission deadline to avoid any overlap.

Data Resources

The Test Dataset

The test dataset will contain video links of approx. 2000 YouTube shorts. More details about domain, quantity, etc will be added later. The distribution of the test dataset will be announced via the active participants mailing list and according to the published schedule.
Topics format
Answer Generation (AG) Task

Topics (queries) for the AG task will take the following format:
Q_ID , Video_ID, Question

Where Q_ID is the query ID, video_ID is the video ID of the query. For example:
1, tui89Xr_iri , what happened after the woman entered the room?

Multiple Choice (MC) Task

Topics (queries) for the MC task will take the following format:
Q_ID , Video_ID, Question, option_1, option_2, option_3, option_4

For example:
1, tui89Xr_iri , what happened after the woman entered the room?, found a party, the room was empty, a man surprised her, a dog jumped on her

The Development Dataset

Teams are welcome to utilize any available training and development datasets available in the public domain for research purposes. Example of existing datasets can be found HERE. Coordinators will announce any plans for release of dedicated training/validation datasets for this track.

Run Submission Format:

All runs should be submitted in csv comma delimited file format as shown below for both tasks.

Answer Generation (AG) Task

Runs for the AG task should follow this submission format:

Q_ID , Video_ID, Rank, Answer, Time (sec)

For example:
1, tui89Xr_iri , 1, she found a surprise birthday party, 5
1, tui89Xr_iri , 2, she found party, 6
1, tui89Xr_iri , 3, she found a group of people, 8
.
.
1, tui89Xr_iri , 10, a dog barked at her,10

Multiple Choice (MC) Task

Runs for the MC task should follow this submission format:

Q_ID , Video_ID, Rank, option_X

For example:
1, tui89Xr_iri , 1, the room was empty
1, tui89Xr_iri , 2, a dog jumped on her
1, tui89Xr_iri , 3, found a party
1, tui89Xr_iri , 4, a man surprised her