Section	1. System Description 

A video is represented using low-level visual information (static and motion descriptors) 
and a sequence of model vectors, as explained in the following. For capturing motion 
information, improved dense trajectories and Fisher Vector (FV) encoding is used to describe the 
entire video with a high-dimensional motion feature vector. For extracting static visual 
descriptors and model vectors, each video is decoded and a set of key-frames are extracted
at fixed temporal intervals (one key-frame every 6 seconds). Then SIFT, opponentSIFT, rgbSIFT 
and rgbSURF are extracted from each key-frame, using dense sampling, which if followed by VLAD
encoding (separately for each of the four descriptors mentioned above). Each key-frame is also 
represented with a set of model vectors (i.e., a vector of responses of concept detectors); 
these detectors are build using the aforementioned VLAD-encoded static features and linear SVMs, 
and detectors for 346 concepts are used (i.e., the TRECVID SIN 2014 dataset concepts; the 
detectors are those trained for the SIN 2014 task). Subsequently, the VLAD-encoded static 
features and the model vectors for all key-frames of a video are averaged, to get a set of 
global video descriptor vectors / model vectors.

All the above video feature vectors for a video are then concatenated to a single high-dimensional 
feature vector. These feature vectors are then used to build one detector for each event. 
This detector is built by combining a new very fast nonlinear discriminant analysis (DA) technique,
for dimensionality reduction, and a Linear SVM, for final classification.

For the MER task the derived model vectors at key-frame level are employed along with a variant 
of the above DA method in order to identify the most characteristic concepts for the specified 
event and the given video, also allowing for temporal localization (by looking at the corresponding 
model vectors at key-frame level, before their temporal averaging to a single model vector per video).

Section	2. Metadata Generator Description

Motion and static visual features are exploited as described in the following:

- Improved dense trajectories (DT) are employed providing the following low-level features: 
Histogram of Oriented Gradients (HOG), Histogram of Optical Flow (HOF) and Motion Boundary 
Histograms (MBH). Hellinger kernel normalization is applied to the resulting feature vectors 
followed by Fisher Vector (FV) encoding with 256 GMM codewords. Subsequently, the three 
feature vectors are concatenated to yield the final motion feature descriptor for each video.

- Each video is decoded into a set of key-frames in fixed temporal intervals.
Four different local descriptors (SIFT, opponentSIFT, rgbSIFT, rgbSURF) with dense sampling
are applied to extract local visual information for every key-frame. The extracted low level
features are aggregated into a global image representation using VLAD encoding.

- Each key-frame is also represented with a set of model vectors, one for each feature extraction
procedure using 346 pre-trained concept detectors. The concepts used are the  TRECVID SIN 2014 
dataset concepts. The model vectors referring to the same key-frame are aggregated using the 
arithmetic mean operator, and subsequently, the model vectors of a video are averaged to 
represent the video.

Section	3. Semantic Query Generator Description

Our Semantic Queries are produced manually by visual inspection of the event description kit 
and our concept detectors. Thus, a set of selected concepts are used as semantic query for 
every event.

Section	4. Event Query Generator Description

We use nonlinear discriminant analysis (DA) to derive a lower dimensional embedding of
the original data, and then employ fast Linear SVMs in the resulting subspace to learn the 
events. Particularly, for dimensionality reduction we utilize a new very fast kernel 
subclass-based method, which is shown to outperform other DA approaches.

Section	5. Event Search Description 

We apply each one of the trained event detectors to every video of the MED14-EvalSub set and 
then we rank their scores in descending order. As threshold is selected the 300th score of
the ranking list for the 010Ex task and the 100th score for the 100Ex task. 

Section	6 Training data and knowledge sources

a. TRECVID SIN 2014 SIN dataset for training concept detectors