1.  System Description:

The Sesame system uses the following modalities: EuvisionHigh, EuvisionLow, VideoStory, DTFV action recognition, ASR, MFCC, and TextSearch. Each modality generates its own metadata and event query. At the end of an event search, a fusion system combines the scores from the different modalities.

The EuvisionHigh system uses deep learning classification scores as features at the video-level and
an SVM classifier to find positive videos.

The EuvisionLow system uses deep learning inspired features at the video-level and an SVM classifier
to find positive videos.

The VideoStory system uses deep learning inspired features at the video-level in combination with
a VideoStory embedding [1] and an SVM classifier to find positive videos.

The DTFV system uses Dense Trajectory to extract low level action features and a linear SVM for classification.

The ASR system computes probabilistic word lattices from which we extract video-based 1-gram word counts for MED. After stemming, the counts form the metadata and are mapped to features using a log-mapping. Those features are used to train a linear SVM with a l1 penalty.

The ACR system uses 73 trained acoustic models based on acoustic event annotation of SRI, ICSI and CMU on the RESEARCH set. A linear SVM was used to train event classifiers using modified MFCCs with fisher vector encoding.  Those classifiers are used to get a localized information about the prominence of each of these acoustic event in each video. These 73 classifiers SVM responses every second form the metadata for this modality. We convert this signal into a fixed length signature of length 292 for each video, and use a rbf-SVM model for the final MED classification.

The MFCC system performs MED using low-level features only. The metadata consists of fisher vector encoded MFCCs. A linear SVM is used for MED model training.

The TextSearch system uses the combined text detections from the 1-best output of both ASR and VOCR to determine if a video is in event.  We use a probabilistic language modeling retrieval model, backed by a Markov Random Field (MRF).  Here the event kit is considered the query, and is scored against candidate video clips.  Conditional probabilities are obtained over frequency counts over English language Wikipedia, with Laplacian noise modeling.  Following (Metzler and Croft 2005 [2]), we applied Dirichlet smoothing to merge background statistics with those from the clip level.  All input text is lowercased and stemmed before use.  Our MRF implementation permits dependencies between words, to form phrases, and also permits certain terms or concepts in the query to be weighted more highly than the others.

The fusion system performs fusion in the following ways:

100Ex and 010Ex: A logistic regression-based fusion is used to combine scores from different modalities and provides a theoretically optimal threshold for the R0 metric.

SQ: Scores from each modality are Z-normalized using statistics computed on the RESEARCH set on events E001-5, then fused using a sigmoid mapping followed by averaging.



2.  Metadata Generator Description:

The EuvisionHigh system computes the deep learning classification scores on two frames per second.
The scores are averaged per video to obtain a video-level representation.

The EuvisionLow system computes the deep learning features on two frames per second.
The features are averaged per video to obtain a video-level representation.

The VideoStory system computes the deep learning features on two frames per second.
The features are averaged per video to obtain a video-level representation.

The DTFV system uses dense trajectory features coded by Fisher Vectors. Features are computed over 2 second long segments as well as over the entire video. For each segment, we also compute scores for 164 action concepts by applying linear SVM to DTFV features.

The ASR system uses an English ASR model trained on conversational telephone data is used, and adapted to meetings data. We performed supervised acoustic model adaptation using the LDC201208 release, and unsupervised adaptation using first-pass recognition. We also perform supervised and unsupervised language model adaptation to the ALADDIN domain. ASR is used to compute probabilistic word lattices from which we extract video-based 1-gram word counts for MED. After stemming, the counts form the metadata.

The ACR system generates metadata in two steps:
- extract modified MFCC features from the audio signal every 25ms with a 10ms shift using:
* 8000Hz sampling rate
* 40 linearly spaced filters
* 20 cepstral coefficients
* gain normalization
* single deltas

- compute fisher vector vs a 256-gaussian GMM codebook trained using the same MFCC features on the RESEARCH set:
* obtain derivatives vs means and variances
* l2 normalization
* square root normalization

- evaluate the acoustic event classifiers on the audio track of the video each second using overlapping windows of width 2sec.

The MFCC system extracts modified MFCCs using 25ms windows shifted by 10ms using:
* 8000Hz sampling rate
* 40 linearly spaced filters
* 20 cepstral coefficients
* gain normalization
* DCT contextualization using a 31ms window and by appending the 3rd DCT coefficient to the signal.

- compute fisher vector vs a 256-gaussian GMM codebook trained using the same MFCC features on the RESEARCH set:
* obtain derivatives vs means and variances
* l2 normalization
* square root normalization

The TextSearch system uses the textual detections from ASR and VOCR directly.  The metdata language consists of words and phrases deemed to be indicative of the event, and a real valued weight can be associated with it (default value is 1.0).



3.  Semantic Query Generator Description:

The Sesame system generates semantic query using the VideoStory, TextSearch and action concept modules.

The VideoStory system's semantic query generator takes the event description and turns it into a VideoStory representation.

The TextSearch semantic query generator uses a set of English language seed terms, we use circular similarities in a distributed word space (computed with word2vec [3] over Wikipedia) to identify additional similar and related terms to the seed terms.  To facilitate rapid selection and removal of terms, we further cluster the additional terms, to identify thematic groups.  Each group is then described using 1-3 centroids, to provide a quick summary of its contents.

The action concept system takes query terms from event kit text and find the closest concept in our concept hierarchy.



4.  Event Query Generator Description:

The EuvisionHigh system's event query generator uses the deep learning classification scores as features
to train an SVM classifier that serves as event detector.

The EuvisionLow system's event query generator uses the deep learning feature to train an SVM classifier that serves
as event detector.

The VideoStory system's evemt query generator uses the deep learning feature with a VideoStory embedding.
A trained SVM classifier serves as event detector.

The DTFV system uses a linear SVM learned from video level DTFV features using positives from event kit and given background videos. We also compute scores for tags in SQG on training data and keep only three top ranking tags.

The ASR system maps stemmed word counts of training videos to log-counts using a soft cutoff of 1e-4. Counts from all the words are concatenated in a feature vector of dimension around 40,000. A linear SVM is then trained on these feature vectors using 10-fold cross validation to tune the regularization factor, as well as to generate training scores for fusion. Videos with zero counts are used for training.

The ACR system concatenates four different kind of statistics to create a vector of length 4 * 73 = 292 for each event.
* average score
* average of the 5 top scoring frames
* minimum of the 5 top scoring frames
* acoustic event response when run on the whole video

It trains a rbfSVM classifier using 10-fold cross validation to tune the regularization factor, as well as to generate training scores for fusion.

The MFCC system trains a linear SVM classifier using 10-fold cross validation to tune the regularization factor, as well as to generate training scores for fusion.

The TextSearch system uses initial experiments with training concept weights, we forgo the training phase and use the Semantic Query directly in Event Search.
A linear SVM is learned for video level DTFV features using positives from event kit and given background videos. We also compute scores for tags in SQG on training data and keep only three top ranking tags.

The Fusion system generates event models in the following ways: 
100Ex and 010Ex: 

Using 10-fold cross validation scores from every modality, form a feature vector for each trial (event/video pair) consisting of:
* concatenated scores from each modality, using 0 if the score is missing
* concatenated binary indicators for each modality, set to 1 if the score is missing for this trial and 0 otherwise

Train a logistic regression classifier on those feature vectors, for each event. Use 10-fold cross-validation to tune the regularization parameter. Use iterative retraining and modality selection if the trained LLR weights are negative. Compute and store the prior for each event, as well as trained log-likelihood ratios obtained via cross-validation on the training data.


5.  Event Search Description:

The EuvisionHigh system ranks videos based on the SVM classification results using the deep learning
classification scores.

The EuvisionLow system ranks videos based on the SVM classification results using the deep learning features.

the VideoStory system ranks videos based on the SVM classification results using the deep learning features.

The DTFV system applies the learned linear SVM to DTFV features of each video in the progress set.

The ASR system maps stemmed word counts of testing videos to log-counts using a soft cutoff of 1e-4. The trained linear SVM is used to predict MED scores for videos whose expected word count is above 1e-3. The videos which don’t make this cutoff have a score considered “missing” for fusion purposes.

The ACR system concatenates four different kind of statistics to create a vector of length 292.
It then applies the previously trained rbfSVM classifier

The MFCC system simply applies the previously trained SVM classifier.

The TextSearch system converts a given event query into a Markov Random Field per (Metzler and Croft 2005 [2]), asserting each concept as a node.  This model is then used to score each video, using the model.
ES applies the learned linear SVM to DTFV features of each video in the progress set.

The Fusion system performs final fusing of detection scores in the following ways: 
100Ex and 010Ex: 
Using ES scores from every modality, form a feature vector for each trial (event/video pair) as done during training. Apply the cross-validation models and subtract the training prior to obtain likelihood ratios. For 010Ex, the selected threshold was the optimal bayesian threshold assuming a similar prior on the training and test data. For 100Ex, we selected the threshold that maximized R0 on the training data using the cross-validation log-likelihood ratios.

SQ:
ES scores from the two modalities were Z-normalized using statistics computed on the RESEARCH set on events E001-5, then mapped to [0,1] using a sigmoid function. Mean score averaging gives the final fused score, with a score of 0 assumed for missing scores. Thresholds were computed using the RESEARCH set on E001-5.



6.  Training data and knowledge sources:

The EuvisionHigh system's deep learning concepts are based on annotations from ImageNet.

The EuvisionLow system's deep learning features are based on annotations from ImageNet.

The VideoStory system's deep learning features are based on annotations from ImageNet.
The VideoStory embedding is based on videos crawled from YouTube [1].

The Sesame system's action concepts are learned on UCF 101 action dataset[4] and NIST SIN dataset.

The TextSearch vocabularies are from:
- English language Wikipedia
- English language Gigaword



7.  References:

[1] A. Habibian, T. Mensink, and C.G.M. Snoek.
VideoStory: A New Multimedia Embedding for Few-Example Recognition and Translation of Events.
In ACM Multimedia, 2014.

[2] Metzler, D. and Croft, W.B., "A Markov Random Field Model for Term Dependencies," Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR 2005), 472-479, 2005.

[3] word2vec, https://code.google.com/p/word2vec/

[4] http://crcv.ucf.edu/data/UCF101.php