====Section 1, System Description==== The CMU system utilizes multiple modalities, classifiers and fusion methods to perform Multimedia Event Detection. The multiple modalities include visual, audio and text modalities. For 010Ex and 100Ex, two classifiers were used: linear SVM and linear regression. The fusion method used for 010Ex and 100Ex is the Multistage Hybrid Late Fusion, which is a combination of many different fusion algorithms. For the SQ and 000Ex runs, we utilize concept detection results from 3000 concept detectors during the SQG and ES stage. ====Section 2, Metadata Generator Description==== We first extract the following features: 1. SIFT bag-of-words and fisher vectors, similar to last year. 2. Color SIFT (CSIFT) bag-of-words and fisher vectors, similar to last year. 3. Transformed Color Histogram (TCH) bag-of-words and fisher vectors, similar to last year. 4. Motion SIFT (MoSift) bag-of-words and fisher vectors, similar to last year. 5. STIP bag-of-words and fisher vectors, similar to last year. 6. MFCC bag-of-words and fisher vectors, similar to last year. 7. CMU Improved Dense Trajectory CMU Improved Dense Trajectory improves the original Improved Dense Trajectory in two ways: first, it achieves temporal scale-invariance. Second, we encode spatial and location information into Fisher vector representation. 8. Automatic Speech Recognition (ASR) Our ASR system is based on Kaldi, an open-source speech recognition toolkit. We build the HMM/GMM acoustic model with speaker adaptive training. Our trigram language model is pruned aggressively to speed up decoding. When applied on the evaluation data, our ASR system generates the best hypothesis for each utterance. Two passes of decoding are performed with an overall real-time factor of 8. 9. Optical character recognition (OCR) from Nuance, similar to last year. 10. Semantic concept detectors The shot-based semantic concept detectors are trained by our pipeline designed by our team based on CascadeSVM and self-paced learning. Our system includes more than 3,000 concept detectors that are trained over around 2.7 million shots. Generally, the detectors cover people, scenes, body movements, activities, sports, etc. 11. Large Scale Audio Features, similar to last year. 12. Noisemes (audio semantic classifiers) and emotion classifiers trained on IEMOCAP. 13. Deep convolutional neural networks (DCNN) trained on ImageNet. 14. UC The UC representation are derived by projecting MFCC features onto the surface of a spherical manifold. The underlying hypothesis is that such a projection can enhance scale-invariant characteristics of the data. Conventional affinity-propagation based clustering is performed on a collection of training data that have been projected onto the manifold. Subsequently, the clusters are used to compute bag-of-word representations of recordings, by projecting feature vectors from the recordings onto the sphere and assigning them to clusters. 15. Acoustic unit descriptors (AUDs) AUDs are automatically discovered temporally-patterned sound units that attempt to capture the underlying structure in the audio. The units are learned through statistical analysis of a large corpus of audio recordings, employing constraints that capture the expected natural structure of data, such as the power-law occurrence of units. Subsequently, all recordings are decoded into sequences of these units. The AUDs representation is a bag-of-words representation computed from the unit sequences The bag-of-words features go through the explicit feature map so that linear classifiers can be applied. All features are go through dimensionality reduction so that event search can complete in 5 minutes. ====Section 3, Semantic Query Generation Description==== The semantic query generation has the following steps. 1. Find highly similar terms in event kit text and concept vocabulary through word2vec and word similarity computation based on indexed Wikipedia articles. 2. User will go through the list of relevant concepts found in the previous step and filter out irrelevant ones. The reliability/importance of the concepts and the dataset on which the concepts were trained will also be considered during filtering. 3. Final query xml is generated once the concepts are confirmed by the user. ====Section 4, Event Query Generation Description==== 000Ex: The 000Ex EQG is essentially the same as SQG except for one difference: for Pre-Specified events we explored a preliminary weighting scheme using 5000 negative samples. Therefore the 000Ex run and the SQ run are likely to be different for Pre-Specified events but they are identical for Ad-Hoc events. 010Ex/100Ex: These two conditions use the same pipeline for EQG. The 10/100 exemplars are combined with the 5000 negatives to train linear SVM and linear regression models on all 47 features. We then train early fusion classifiers. Due to the fact that we use linear classifiers and explicit feature mapping to approximate the dot product in linear classification this year, we cannot use the traditional double fusion. Instead, we use the proposed Multistage Hybrid Late Fusion algorithm to learn the fusion weight. ====Section 5, Event Search Description==== SQ/000Ex ES: For visual concepts our search uses an attribute retrieval model. For ASR and OCR, it uses the standard language retrieval model where for short and long queries different smoothing methods are applied. After retrieving the ranked list for ASR, OCR and concepts, we apply a normalized fusion technique to fuse different ranked lists according to the weights provided in SQG. 010Ex/100Ex ES: We run prediction on the 47 x 2 models trained during the EQG step. The final prediction scores are computed by merging the ranked lists according to the fusion score learnt in EQG. R0 threshold is computed using the same method from last year. PRF for 000Ex and 010Ex: We utilize the MMPRF and Self-Paced Reranking method to run pseudo-relevance feedback (PRF). DCNN, improved trajectories and MFCC features were used during PRF. We did not run PRF for SQ and 100Ex runs. ====Section 6, Training Data and Knowledge Sources==== 1. The TRECVID Semantic Indexing 2014 dataset was used to train concept detectors and performed prediction on the MED corpora. 2. ImageNet 3. UCF101 (http://crcv.ucf.edu/data/UCF101.php) 4. HMDB51 (http://serre-lab.clps.brown.edu/resource/hmdb-a-large-human-motion-database/) 5. Yahoo Flickr Creative Commons (http://webscope.sandbox.yahoo.com/catalog.php?datatype=i&did=67) 6. IEMOCAP (http://sail.usc.edu/iemocap/)