Video Recognition and Retrieval at the TRECVID Benchmark

ECCV 2018 Tutorial

September 8th, 2018 - Munich, Germany



The half day tutorial will focus on reviewing the history of TREC Video Retrieval Evaluation (TRECVID) and what resources it provides to the computer vision community overall and video retrieval specifically. Several of the TRECVID tasks/tracks will be discussed and highlight participant approaches and lessons learned. Participants, by the end of the tutorial, are expected to gain knowledge and practical experience in building the basic pipeline components needed in each of the tasks. Below is a description of the topics that will be presented for each of the TRECVID tutorial sessions:

Lecture 1: Introduction to TRECVID

George Awad, National Institute of Standards and Technology

In this short introduction, we will discuss TRECVID history including the objectives of TRECVID, the different tasks and datasets that the project supported since 2001, it’s impact on the research community, resources available and future directions.

Lecture 2: Video To Text (VTT)

Alan Smeaton, Dublin City University

This section of the course will cover the operation of the TRECVid Video-to-Text task including the data used, the approaches taken by participants [1], lessons learned and the unique way in which video caption generation is evaluated [2].

Lecture 3: Ad Hoc Video Search (AVS)

Kazuya Ueki, Meisei University; Waseda University

TRECVID Ad-hoc Video Search task is aiming at modeling the end user search use-case, who is looking for segments of video containing persons, objects, activities, locations, etc. and combinations of the former [3]. The task is defined as follows: given a test collection, a reference shot segmentation, and a set of Ad-hoc queries, return for each query a ranked list of at most 1000 shot IDs from the test collection. This tutorial section will give an overview of video retrieval systems submitted to participate in the task [4], with topics in (i) development of huge concept bank, (ii) search keyword selection from an ad-hoc query by the natural language processing, (iii) concept classifier selection from a search keyword, and (iv) fusion methods of selected multiple classifiers. We will also discuss the current research issues and future research directions.

Lecture 4: Activity Recognition (MED/SED)

Cees Snoek, University of Amsterdam

This tutorial highlights lessons learned towards spatiotemporal detection of activities, like ‘working on a woodworking project’, ‘open trunk’ and ‘winning a race without a vehicle’, in the context of the multimedia event detection (MED) and surveillance event detection (SED) tasks. In the first part of the lecture we consider the scenario where in the order of ten to hundred examples are available. We provide an overview of supervised classification approaches to spatiotemporal activity detection. As activities become more and more specific, it is unrealistic to assume that ample spatiotemporal examples to learn from will be commonly available [5]. That is why we turn our attention to zero-shot retrieval approaches in the second part. The key to activity recognition when examples are absent is to have a lingual video representation. Once the video is represented in a textual form, standard retrieval metrics can be used. We cover video representation learning algorithms that emphasize on semantic embeddings and detail how these representations allow for accurate activity retrieval and are also able to translate and summarize activities in video content [6], even in absence of training examples.

Lecture 5: Instance Search (INS)

Shin'ichi Satoh, National Institute of Informatics

TRECVID Instance Search task [7] aims at exploring technologies to efficiently and effectively search and retrieve specific objects from videos by given visual examples. The task is especially focusing on finding "instances" of object, person, or location. This tutorial section will give an overview of Instance Search task followed by standard pipeline including short list generation by bag of visual word technique, handling of geometric information and context, efficiency management such as inverted index, etc. Since specific object retrieval along with related datasets is very popular in computer vision community, differences between problem settings in computer vision community and those in TRECVID INS will also be discussed.

Relevant publications:

[1] G. Awad, C. G. M. Snoek, A. F. Smeaton, and G. Quénot, ITE Transactions on Media Technology and Applications, 4(3), 187-208, 2016.
[2] Y. Graham, G. Awad, and A. F. Smeaton, “Evaluation of Automatic Video Captioning Using Direct Assessment”, arXiv preprint arXiv:1710.10586, 2017.
[3] G. Awad, A. Butt, J. Fiscus, D. Joy, A. Delgado, M. Michel, A. F. Smeaton, Y. Graham, W. Kraaij, G. Qu´enot, M. Eskevich, R. Ordelman, G. J. F. Jones, and B. Huet, “Trecvid 2017: Evaluating ad-hoc and instance video search, events detection, video captioning and hyperlinking,” In Proc. of TRECVID 2017, 2017.
[4] K. Ueki, K. Hirakawa, K. Kikuchi, T. Ogawa, and T. Kobayashi, “Waseda_Meisei at TRECVID 2017: Ad-hoc Video Search,” In Proc. of TRECVID 2017, 2017.
[5] A. Habibian, T. Mensink, and C. G. M. Snoek, " Video2vec Embeddings Recognize Events when Examples are Scarce ," IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 39, iss. 10, pp. 2089-2103, 2017.
[6] P. l. Mettes and C. G. M. Snoek, "Spatial-Aware Object Embeddings for Zero-Shot Localization and Classification of Actions," in Proceedings of the IEEE International Conference on Computer Vision , Venice, Italy, 2017.
[7] G. Awad, W. Kraaij, P. Over, and S. Satoh, ``Instance Search Retrospective with Focus on TRECVID,'' International Journal of Multimedia Information Retrieval, Vol. 6, No. 1, pp. 1-29, 2017 .