Video Question Answering (VQA)

Task Coordinators: George Awad, Sanjay Purushotham

Recent advancements in large multimodal models have significantly improved AI's ability to process and understand complex data across multiple modalities, including text, images, and video. However, true comprehension of video content remains a formidable challenge, requiring AI systems to integrate visual, auditory, and temporal information to answer questions meaningfully. The Video Question Answering (VQA) Challenge aims to rigorously assess the capabilities of state-of-the-art multimodal models in understanding and reasoning about video content. Participants in this challenge will develop and test models that answer a diverse set of questions based on video segments, covering various levels of complexity, from factual retrieval to complex reasoning. The challenge track will serve as a critical evaluation framework to measure progress in video understanding, helping identify strengths and weaknesses in current AI architectures. By fostering innovation in multimodal learning, this track will contribute to advancing AI’s ability to process dynamic visual narratives, enabling more reliable and human-like interaction with video-based information.

System Tasks

Update to 2026 testing videos

A subset of the testing videos is planned, if resources allowed for human annotation, to be more longer in duration than the default YouTube shorts. This subset, taken from a licensed movie dataset, will be approx. 20-25% from the whole testing dataset and have average duration of about 1 to 2 min long and maximum of about 6 min.

Answer Generation (AG)

Given a set of X videos (with variable duration range (30sec to 2 min on average)) and a set of questions (one per video) the task is as follows:

For each video, automatically generate top 10 textual answers, in a ranked list format, to the provided question using natural language. For each answer, also report the real-time (in seconds) spent to generate the answer. For any given video and question, runs must submit 10 generated answers. Any submission with less than 10 answers will be rejected by the validator.

Multiple Choice (MC)

Given a set of X videos (with variable duration range (30sec to 2 min on average)) and a set of question-answers (QA) pairs for each video, the task is as follows:

For each video, automatically sort the list of best answer options from most likely correct answer to least likely correct answer option.

Both AG and MC tasks will share the same dataset of testing videos. However, each task will follow different submission deadline to avoid any overlap.

Question Categories

Temporal : Questions that require understanding of the timing, change, or sequence of events in the video.
Causal : Questions that require reasoning about the cause or motivation behind an event or action shown in the video.
Audio/Multimodal : Questions that rely on the audio information (spoken words, music, sound effects) and/or displayed text.
Multi-hop reasoning : Questions that require combining multiple observations or inferences.
People-Object Interaction : Questions involving interactions between people and objects or animals.
Quantitative / Counting : Questions involving counting or comparing numbers in the video (including mixture with temporal aspect).
Common sense & real world knowledge : Questions that use everyday reasoning that is implied but not explicitly shown

Data Resources

The Test Dataset

The test dataset will contain new set of video links of approx. 1457 YouTube shorts (annotated in 2025). In addition, we plan to annotate new set of longer video scenes (approx. 400 videos), from movie domain, subject to availability of resources. More details will be added later once confirmed.
Topics format

Answer Generation (AG) Task

Topics (queries) for the AG task will take the following json file format:


  [
    {
      "Q_ID":"integer (a query ID)",
      "Video_ID":"string (video ID, can be either YouTube video id or local video clip file name)",
      "Question":"string (question about the video to be answered by systems automatically)"
    }
  ]

For example:


  [
    {
      "Q_ID":1,
      "Video_ID":"tui89Xr_iri",
      "Question":"what happened after the woman entered the room?"
    },
    {
      "Q_ID":2,
      "Video_ID":"Hut_563Wop",
      "Question":"How many times did the man go inside and outside the room?"
    }
  ]

Multiple Choice (MC) Task

Topics (queries) for the MC task will take the following json file format:


  [
    {
      "Q_ID":"integer (a query ID)",
      "Video_ID":"string (video ID, can be either YouTube video id or local video clip file name)",
      "Question":"string (question about the video to be answered by systems automatically)",
      "Options":[
        {"option":"string (a possible answer to the question about the video)"},
        {"option":"string (a possible answer to the question about the video)"},
        {"option":"string (a possible answer to the question about the video)"},
        {"option":"string (a possible answer to the question about the video)"}
      ]
    }
  ]

For example:


  [
    {
      "Q_ID":1,
      "Video_ID":"tui89Xr_iri",
      "Question":"what happened after the woman entered the room?",
      "Options":[
        {"option":"found a party"},
        {"option":"the room was empty"},
        {"option":"a man surprised her"},
        {"option":"a dog jumped on her"}
      ]      
    },
    {
      "Q_ID":2,
      "Video_ID":"Hut_563Wop",
      "Question":"How many times did the man go inside and outside the room?",
      "Options":[
        {"option":"3 times"},
        {"option":"2 times"},
        {"option":"4"},
        {"option":"5 times only"}
      ]      
    }
  ]

The Development Dataset

Teams are welcome to utilize any available training and development datasets available in the public domain for research purposes.
Training dataset is now available (Please use your TREC active participants username and password): VQA Training Dataset with a readme file

Run Submission Format:

All runs should be submitted in json file format as shown below for both tasks.

Answer Generation (AG) Task

Runs for the AG task should follow this json structure format:


  [
  {
    "Q_ID":"integer (query ID as provided in the testing file)",
    "Video_ID":"string (video ID corresponding to the Q_ID as provided in the testing file)",
    "Answers":[
      {"Rank": "integer" (rank of this answer), "Answer":"string", "Time": "float" (time to generate this answer)}
    ]
  }
  ]

Example:

    
[
  {
  "Q_ID": 1,
  "Video_ID": "tui89Xr_iri",
  "Answers":[
    {"Rank": 1, "Answer": "she found a surprise birthday party", "Time":2.5},
    {"Rank": 2, "Answer": "she found party", "Time":1.5},
    {"Rank": 3, "Answer": "she found a group of people", "Time":3},
    {},
    {},
    {"Rank": 10, "Answer": "a dog barked at her", "Time":7}
  ]
  },
  {
  "Q_ID": 2,
  "Video_ID":"GWi31Xr_iui",
  "Answers":[
    {"Rank": 1, "Answer": "4 times", "Time":2.5},
    {"Rank": 2, "Answer": "2 times", "Time":1.5},
    {"Rank": 3, "Answer": "eight times", "Time":3},
    {},
    {},
    {"Rank": 10, "Answer": "nine times", "Time":7}
  ]
  }
]

Multiple Choice (MC) Task

Runs for the MC task should follow this json file format:


  [
    {
      "Q_ID":"integer (query ID as provided in the testing file)",
      "Video_ID":"string (video ID corresponding to the Q_ID as provided in the testing file)",
      "Answers":[
        {"Rank": "integer" (rank of this answer), "Answer":"string (one of the provided answer options in the testing queries)"}
      ]
    }
  ]


Example:
[
    {
      "Q_ID":1,
      "Video_ID":"tui89Xr_iri",
      "Answers":[
        {"Rank": 1, "Answer":"the room was empty"},
        {"Rank": 2, "Answer":"a dog jumped on her"},
        {"Rank": 3, "Answer":"found a party"},
        {"Rank": 4, "Answer":"a man surprised her"}
      ]
    },
    {
      "Q_ID":2,
      "Video_ID":"Hut_563Wop",
      "Answers":[
        {"Rank": 1, "Answer":"5 times only"},
        {"Rank": 2, "Answer":"4"},
        {"Rank": 3, "Answer":"2 times"},
        {"Rank": 4, "Answer":"3 times"}
      ]
    }
]

Baseline System:

Two teams from VQA 2025 have thankfully shared their system online and can be used as baseline or starter for new teams:

A baseline system by team NII_UIT has been built on the 2025 VQA iteration and can be re-used again: It acts as a comprehensive framework for Video Question Answering (VQA) supporting both Multiple Choice (MC) and Answer Generation (AG) tasks. Built using Vision-Language Models (VLMs) with multiple inference backends. Please contact the author Tien Do ([email protected]) if you have questions.
System link: https://github.com/aiclub-uit/TrecVID2025_VQA

A system containing a complete training and inference pipeline for answer reranking in Video Question Answering (VQA). Please contact the author Will Walden ([email protected]) if you have questions.
System link: https://github.com/adoptedirelia/TREC2025-HLTCOE

Evaluation and Metrics:

Answer Generation Task

Answers in this task will be scored automatically using metrics such as STS (Semantic Textual Similarity), METEOR, and BERTScore for answer quality
NDCG for answer ranking
Reporting metrics at different cut off ranks (top 1, top 5, top 10)

Multiple Choice Task

Top-1 accuracy
Mean Reciprocal Rank (MRR)

Evaluation scripts can be accessed from https://github.com/gawad-nist/VQA Scoring code will be updated to reflect the current year guidelines.

Digital Video Retrieval at NIST

Digital Video Retrieval at NIST
News magazine, science news, news reports, documentaries, educational programming, and archival video

Digital Video Retrieval at NIST
TV Episodes

Digital Video Retrieval at NIST
Airport Security Cameras & Activity Detection

Digital Video Retrieval at NIST
Video collections from News, Sound & Vision, Internet Archive,
Social Media, BBC Eastenders

Video Question Answering (VQA)

Task Coordinators: George Awad, Sanjay Purushotham

System Tasks

Update to 2026 testing videos

Answer Generation (AG)

Multiple Choice (MC)

Question Categories

Data Resources

The Test Dataset

Topics format

Answer Generation (AG) Task

Multiple Choice (MC) Task

The Development Dataset

Run Submission Format:

Answer Generation (AG) Task

Multiple Choice (MC) Task

Baseline System:

Evaluation and Metrics:

Answer Generation Task

Multiple Choice Task