ECCV 2018 Tutorial
September 8th, 2018 - Munich, Germany
Organizers
- George Awad
- Alan Smeaton
- Cees Snoek
- Shin'ichi Satoh
- Kazuya Ueki
Introduction
The half day tutorial will focus on reviewing the history of TREC Video Retrieval
Evaluation (TRECVID) and what resources it provides to the computer vision community
overall and video retrieval specifically. Several of the TRECVID tasks/tracks will be
discussed and highlight participant approaches and lessons learned. Participants, by the end
of the tutorial, are expected to gain knowledge and practical experience in building the basic
pipeline components needed in each of the tasks. Below is a description of the topics that
will be presented for each of the TRECVID tutorial sessions:
Lecture 1: Introduction to TRECVID
George Awad, National Institute of Standards and Technology
In this short introduction, we will discuss TRECVID history including the objectives of
TRECVID, the different tasks and datasets that the project supported since 2001, it’s impact
on the research community, resources available and future directions.
Lecture 2: Video To Text (VTT)
Alan Smeaton, Dublin City University
This section of the course will cover the operation of the TRECVid Video-to-Text task
including the data used, the approaches taken by participants [1], lessons learned and the
unique way in which video caption generation is evaluated [2].
Lecture 3: Ad Hoc Video Search (AVS)
Kazuya Ueki, Meisei University; Waseda University
TRECVID Ad-hoc Video Search task is aiming at modeling the end user search use-case,
who is looking for segments of video containing persons, objects, activities, locations, etc.
and combinations of the former [3]. The task is defined as follows: given a test collection, a
reference shot segmentation, and a set of Ad-hoc queries, return for each query a ranked list
of at most 1000 shot IDs from the test collection. This tutorial section will give an overview of
video retrieval systems submitted to participate in the task [4], with topics in (i) development
of huge concept bank, (ii) search keyword selection from an ad-hoc query by the natural
language processing, (iii) concept classifier selection from a search keyword, and (iv) fusion
methods of selected multiple classifiers. We will also discuss the current research issues and
future research directions.
Lecture 4: Activity Recognition (MED/SED)
Cees Snoek, University of Amsterdam
This tutorial highlights lessons learned towards spatiotemporal detection of activities, like
‘working on a woodworking project’, ‘open trunk’ and ‘winning a race without a vehicle’, in the
context of the multimedia event detection (MED) and surveillance event detection (SED)
tasks. In the first part of the lecture we consider the scenario where in the order of ten to
hundred examples are available. We provide an overview of supervised classification
approaches to spatiotemporal activity detection. As activities become more and more
specific, it is unrealistic to assume that ample spatiotemporal examples to learn from will be
commonly available [5]. That is why we turn our attention to zero-shot retrieval approaches
in the second part. The key to activity recognition when examples are absent is to have a
lingual video representation. Once the video is represented in a textual form, standard
retrieval metrics can be used. We cover video representation learning algorithms that
emphasize on semantic embeddings and detail how these representations allow for accurate
activity retrieval and are also able to translate and summarize activities in video content [6],
even in absence of training examples.
Lecture 5: Instance Search (INS)
Shin'ichi Satoh, National Institute of Informatics
TRECVID Instance Search task [7] aims at exploring technologies to efficiently and
effectively search and retrieve specific objects from videos by given visual examples. The
task is especially focusing on finding "instances" of object, person, or location. This tutorial
section will give an overview of Instance Search task followed by standard pipeline including
short list generation by bag of visual word technique, handling of geometric information and
context, efficiency management such as inverted index, etc. Since specific object retrieval
along with related datasets is very popular in computer vision community, differences
between problem settings in computer vision community and those in TRECVID INS will also
be discussed.
Relevant publications:
[1] G. Awad, C. G. M. Snoek, A. F. Smeaton, and G. Quénot, ITE Transactions on Media
Technology and Applications, 4(3), 187-208, 2016.
[2] Y. Graham, G. Awad, and A. F. Smeaton, “Evaluation of Automatic Video Captioning
Using Direct Assessment”, arXiv preprint arXiv:1710.10586, 2017.
[3] G. Awad, A. Butt, J. Fiscus, D. Joy, A. Delgado, M. Michel, A. F. Smeaton, Y. Graham,
W. Kraaij, G. Qu´enot, M. Eskevich, R. Ordelman, G. J. F. Jones, and B. Huet, “Trecvid
2017: Evaluating ad-hoc and instance video search, events detection, video captioning and
hyperlinking,” In Proc. of TRECVID 2017, 2017.
[4] K. Ueki, K. Hirakawa, K. Kikuchi, T. Ogawa, and T. Kobayashi, “Waseda_Meisei at
TRECVID 2017: Ad-hoc Video Search,” In Proc. of TRECVID 2017, 2017.
[5] A. Habibian, T. Mensink, and C. G. M. Snoek, " Video2vec Embeddings Recognize
Events when Examples are Scarce ," IEEE Transactions on Pattern Analysis and Machine
Intelligence , vol. 39, iss. 10, pp. 2089-2103, 2017.
[6] P. l. Mettes and C. G. M. Snoek, "Spatial-Aware Object Embeddings for Zero-Shot
Localization and Classification of Actions," in Proceedings of the IEEE International
Conference on Computer Vision , Venice, Italy, 2017.
[7] G. Awad, W. Kraaij, P. Over, and S. Satoh, ``Instance Search Retrospective with Focus
on TRECVID,'' International Journal of Multimedia Information Retrieval, Vol. 6, No. 1, pp.
1-29, 2017 .