Recent advancements in large multimodal models have significantly improved AI's ability to process and understand complex data across multiple modalities, including text, images, and video. However, true comprehension of video content remains a formidable challenge, requiring AI systems to integrate visual, auditory, and temporal information to answer questions meaningfully. The Video Question Answering (VQA) Challenge aims to rigorously assess the capabilities of state-of-the-art multimodal models in understanding and reasoning about video content. Participants in this challenge will develop and test models that answer a diverse set of questions based on video segments, covering various levels of complexity, from factual retrieval to complex reasoning. The challenge track will serve as a critical evaluation framework to measure progress in video understanding, helping identify strengths and weaknesses in current AI architectures. By fostering innovation in multimodal learning, this track will contribute to advancing AI’s ability to process dynamic visual narratives, enabling more reliable and human-like interaction with video-based information.
Both AG and MC tasks will share the same dataset of testing videos. However, each task will follow different submission deadline to avoid any overlap.
The test dataset will contain new set of video links of approx. 1457 YouTube shorts (annotated in 2025). In addition, we plan to annotate new set of longer video scenes (approx. 400 videos), from movie domain, subject to availability of resources. More details will be added later once confirmed.
Topics (queries) for the AG task will take the following json file format:
[
{
"Q_ID":"integer (a query ID)",
"Video_ID":"string (video ID, can be either YouTube video id or local video clip file name)",
"Question":"string (question about the video to be answered by systems automatically)"
}
]
For example:
[
{
"Q_ID":1,
"Video_ID":"tui89Xr_iri",
"Question":"what happened after the woman entered the room?"
},
{
"Q_ID":2,
"Video_ID":"Hut_563Wop",
"Question":"How many times did the man go inside and outside the room?"
}
]
Topics (queries) for the MC task will take the following json file format:
[
{
"Q_ID":"integer (a query ID)",
"Video_ID":"string (video ID, can be either YouTube video id or local video clip file name)",
"Question":"string (question about the video to be answered by systems automatically)",
"Options":[
{"option":"string (a possible answer to the question about the video)"},
{"option":"string (a possible answer to the question about the video)"},
{"option":"string (a possible answer to the question about the video)"},
{"option":"string (a possible answer to the question about the video)"}
]
}
]
For example:
[
{
"Q_ID":1,
"Video_ID":"tui89Xr_iri",
"Question":"what happened after the woman entered the room?",
"Options":[
{"option":"found a party"},
{"option":"the room was empty"},
{"option":"a man surprised her"},
{"option":"a dog jumped on her"}
]
},
{
"Q_ID":2,
"Video_ID":"Hut_563Wop",
"Question":"How many times did the man go inside and outside the room?",
"Options":[
{"option":"3 times"},
{"option":"2 times"},
{"option":"4"},
{"option":"5 times only"}
]
}
]
Teams are welcome to utilize any available training and development datasets available in the public domain for research purposes.
Training dataset is now available (Please use your TREC active participants username and password): VQA Training Dataset with a readme file
All runs should be submitted in json file format as shown below for both tasks.
Runs for the AG task should follow this json structure format:
[
{
"Q_ID":"integer (query ID as provided in the testing file)",
"Video_ID":"string (video ID corresponding to the Q_ID as provided in the testing file)",
"Answers":[
{"Rank": "integer" (rank of this answer), "Answer":"string", "Time": "float" (time to generate this answer)}
]
}
]
Example:
[
{
"Q_ID": 1,
"Video_ID": "tui89Xr_iri",
"Answers":[
{"Rank": 1, "Answer": "she found a surprise birthday party", "Time":2.5},
{"Rank": 2, "Answer": "she found party", "Time":1.5},
{"Rank": 3, "Answer": "she found a group of people", "Time":3},
{},
{},
{"Rank": 10, "Answer": "a dog barked at her", "Time":7}
]
},
{
"Q_ID": 2,
"Video_ID":"GWi31Xr_iui",
"Answers":[
{"Rank": 1, "Answer": "4 times", "Time":2.5},
{"Rank": 2, "Answer": "2 times", "Time":1.5},
{"Rank": 3, "Answer": "eight times", "Time":3},
{},
{},
{"Rank": 10, "Answer": "nine times", "Time":7}
]
}
]
Runs for the MC task should follow this json file format:
[
{
"Q_ID":"integer (query ID as provided in the testing file)",
"Video_ID":"string (video ID corresponding to the Q_ID as provided in the testing file)",
"Answers":[
{"Rank": "integer" (rank of this answer), "Answer":"string (one of the provided answer options in the testing queries)"}
]
}
]
Example:
[
{
"Q_ID":1,
"Video_ID":"tui89Xr_iri",
"Answers":[
{"Rank": 1, "Answer":"the room was empty"},
{"Rank": 2, "Answer":"a dog jumped on her"},
{"Rank": 3, "Answer":"found a party"},
{"Rank": 4, "Answer":"a man surprised her"}
]
},
{
"Q_ID":2,
"Video_ID":"Hut_563Wop",
"Answers":[
{"Rank": 1, "Answer":"5 times only"},
{"Rank": 2, "Answer":"4"},
{"Rank": 3, "Answer":"2 times"},
{"Rank": 4, "Answer":"3 times"}
]
}
]
Evaluation scripts can be accessed from https://gitlab.nist.gov/gitlab/retrieval/vqa-trec-evaluation Scoring code will be updated to reflect the current year guidelines.


News magazine, science news, news reports, documentaries, educational programming, and archival video

TV Episodes

Airport Security Cameras & Activity Detection

Video collections from News, Sound & Vision, Internet Archive,
Social Media, BBC Eastenders