[Section 1] System Description We used 5 visual and audio features + Gaussian mixture model(GMM) supervector [1][2] + Support Vector Machine(SVM). 1. Appearance : Dense HOG features with spatial pyramid (HOG_SP) 2. Color : RGB-SIFT features with spatial pyramid (RGBSIFT_SP) 3. Motion : Dense trajectory with MBH feature. [4] 4. Motion : Dense HOG features with velocity pyramid (HOG_VP) [5] 5. Audio : MFCC features (MFCC) [Section 2] Metadata Generator Description We used Gaussian mixture model(GMM) supervector kernels with visual and audio features. The features extracted from a clip are converted into the GMM supervector by using maximum a posteriori(MAP) adaptation technique. As the priori distribution for MAP adaptation, we used Universal Background Model(UBM), which was estimated from all the features in the training video clips. 5 types of features listed in the following were used. 1. Appearance : Dense HOG features with spatial pyramid (HOG_SP) [3] The HOG features are extracted from 4*4 pixel grids, and each HOG feature has 32 dimensions. We applied PCA without reducing its dimension. The Gaussian mixture number is 512. Pyramid structure : 1x1, 2x2, and 3x1. 2. Color : RGB-SIFT features with spatial pyramid (RGBSIFT_SP) We extracted SIFT from the three channels, and get a 384-dimensional feature. We applied PCA to reduce it to 64 dimension, and use GMM mixture number 512, Pyramid structure : 1x1, 2x2, and 3x1. 3. Motion : Dense trajectory with MBH feature. Resize information : We resize the image into the width of 160 pixels and skip every other frame. We reduce the MBH feature to 64 dimensions by PCA, and use Gaussian mixture number 512. 4. Motion : Dense HOG features with velocity pyramid (HOG_VP) We extract HOG feature from 5*5 pixel grids, and a 1-second frame interval. Each HOG feature has 32 dimensions. We applied PCA without reducing its dimension. The Gaussian mixture number is 512. Pyramid structure : original, 2 directions, 4 directions. 5. Audio : MFCC features (MFCC) [Section 3] Semantic Query Generator Description We only use signal level features, and didn't use any concept detectors, so we created blank semantic queries. [Section 4] Event Query Generator Description For classifier, we used Support Vector Machine(SVM). Distance matrix is created using the provided training videos from IO server and the precomputed GMM supervectors. SVM model is then trained. We use the RBF kernel. [Section 5] Event Search Description We create distance matrix for the training and testing videos, and then get SVM scores of each video for each feature. Next we calculated each clip's detection score by using the weighted average of 5 SVM scores. The fusion weights were decided by 2-fold cross validation. We normalize the SVM score to 0-1 by sigmoid function. The threshold is determined by the averaged threshold from the 2-fold cross validation. This is done by dividing the training set to two parts, and calculate the thresholds that make NDC minimal for each of the two sub-set. [Section 6] Training data and knowledge sources We didn't use resources beyond the provided MED corpora. References [1]N.Inoue, et al., ``TokyoTech+Canon at TRECVID 2012'', TRECVID Workshop, 2012. [2]N.Inoue, et al., ``A Fast MAP Adaptation Technique for GMM-supervector-based Video Semantic Indexing'', In Proc. of ACM Multimedia, pp.1357-1360, 2011. [3]S.Lazebnik, et al., ``Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories '', In Proc. of CVPR, pp. 2169-2178, 2006. [4]H. Wang and C. Schmid. Action recognition with improved trajectories. In Proc. ICCV, 2013. [5]Z. Liang, N. Inoue and K. Shinoda, “Event Detection by Velocity Pyramid” In Proc. of 20th Anniversary International Conference Multimedia Modeling(MMM), pp. 353-364, 2014-01. Contact information : Zhuolin Liang, Tokyo Institute of Technology. (zhuolin@ks.cs.titech.ac.jp) Tran Hai Dang, Tokyo Institute of Technology. (dang@ks.cs.titech.ac.jp)