1. System Description:
The BBNVISER system consists of 4 main components: (a) A metadata generator (MG), (b) a semantic query generator (SQG),(c) an event query generator (EQG), and (d) an event search module (ESM). 
(a) The MG extracts multiple low-level audio and visual features, ASR and OCR transcripts, as well as audio-visual semantic features. All the semantic metadata are in English and can be objects, actions or combinations of both. 
(b) The SQG expands and projects the user query into the system vocabularies (ASR, OCR, audio-visual) and keeps the most significant concepts. 
(c) For the 100Ex and 010Ex training conditions, the EQG trains a linear SVM model for each event. The event models involve early fusion of features within a given modality and late fusion of the ASR, OCR and audio-visual modalities. For the 000Ex training condition, the EQG expands the SQ XML to the corresponding concept lexicon space. 
(d) The ES searches the video test database for a particular event and returns a list of videos ranked based on the probability of belonging to the searched event. It also returns a detection threshold above which a video is considered as a positive for the searched event. More details on these modules are provided in the next sections.  

2. Metadata Generator Description:
The BBNVISER metadata generator (MG) extracts the following features:
- Appearance: D-SIFT Fisher vectors,
- Color: Opponent D-SIFT Fisher vectors,
- Motion: Dense trajectories HOG, HOF and MBH Fisher vectors,
- Audio: MFCC Fisher vectors,
- Deep learning: ImageNet DCNN features,
- VideoWords audio-visual features,
- ASR word lattices features,
- OCR word lattices features.
All the above features are stored in a compressed format for efficient disk I/O. The MG runs on a cluster of computer nodes. It takes about 14 days to extract all the above features from ~200,000 videos.

3. Semantic Query Generator Description:
We use the Indri Document Retrieval System to search for the most relevant words/terms to the given event name in a static offline corpus of around 100,000 Wikipedia and Gigapedia. These words (weigthed by TF-IDF measure) are inturn projected to semantically most similar concepts using a text corpus knowledge source (like Gigawords).

4. Event Query Generator Description:
For the 100Ex (resp. 10Ex) training conditions, we trained linear soft-margin SVM classifiers for each feature based on 100 (resp. 10) positives and about 5,000 negatives (event background). The parameters of the SVM classifier are optimized using k-fold cross validation. Early and late fusion are performed. The EQG runs on a single 16-core  COTS computer. It takes from 10min to 30min to train an event model, depending on the number of cross-validation folds and the sampling of the parameters space. 
For the 000Ex training condition, we expanded the semantic query by adding more relevant concepts from our concept vocabularies.    

5. Event Search Description:
The score of each test video for a particular event is obtained by applying the event model to the test video. Since it only involves the computation of inner products and some posterior score normalization, the event search is very fast. It takes about ~2min to rank a list of ~200,000 videos across all modalities on a single 16-core COTS computer.  

6. Training Data and Knowledge Source:
- The event models were trained using the official TRECVID training partitions.
- The audio-visual concept detectors were trained on the research set videos and on a set of YouTube videos.
- The deep learning architecture was trained on the ImageNet 2012 dataset.
- The speech recognition models and the speech activity detection models were trained on about 1,600 hours of broadcast news data and a small amount of web data. 
- The OCR models and the videotext detection models were trained on document images with various fonts and about 1,000 broadcast news video frames.