==== Section 1, System Description ==== ----section 1.1 VIREO_MED14_EvalFull_PS_000Ex---- In 000Ex system, we generate a metadata store with 1843 concepts in total, which are collected from: <1>. SIN14.346 <2>. Research.Collection.497 <3>. ImageNet.ILSVRC12.1000 [5] Moreover, 5784 documents are collected from Wikipedia and indexed in the metadata store for measuring the inverse document frequency (IDF) score. The video keyframes are extracted uniformly at the rate of one frame every two seconds. The detetion responses of all the 1843 concepts are predicted on all the keyframes of test videos. Then max pooling is utilized to fuse the responses of keyframes from one video and generate a 1843 dimensional video representation. Each dimension denotes the confidence score of the corresponding concept detector to the video. In query phase, our system takes the event description of each event as the query. The queries are then parsed by Stanford CoreNLP parser [6] which analyzes the structure of sentences in terms of phrases and verbs. Note that the event explications are excluded from queries since complex sentences are too difficult to be well parsed. We also enforce lemmatization on both queries and concept names. The following steps are subsequently executed to generate the semantic query per event: <1>. Concepts are automatically selected by keyword matching between concept names and query. These selected concepts are weighted by considering the term frequency in query, term IDF, term relevance to the query, and term specificity which refers to the depth of the corresponding synset in WordNet hierarchy. <2>. All selected concepts are ranked by weights. The top 8 concepts are picked up for each event. The following concepts are reserved if their weights are equal to the weight of the 8th concept. This would normally result in 8-10 concepts finally chosen for each event. These chosen concepts are the semantic query automatically generated by our system. Then in search phase, for each event, we simply rank all test videos by the weighted sum of confidence scores of all chosen concepts. The weights are already determined when generating the semantic query. In addition, videos with OCR which matches to the keywords extracted from event description obtained extra score (bonus) equivalent to the tf-idf text score. ----section 1.2 VIREO_MED14_EvalFull_PS_010Ex---- In 010Ex system, we first decompose each video into two forms: keyframes and video clips. The keyframe sampling rate is set to be one frame every two seconds and clip time duration is set to be five seconds. Metadata language in metadata store is English. We use tesseract-ocr [7] to extract OCR (English) at keyframe level. We extracted the following raw features at keyframe or clip level: <1>. SIN14.346, extracted at keyframe level <2>. Research.Collection.497, extracted at keyframe level <3>. DCNN7, ImageNet.ILSVRC12.1000 [4][5], extracted at keyframe level We use different methods to generate feature vector for each video. In detail <1>. 346 concept detectors trained on TRECVID 2014 SIN task are first used to do the prediction on all keyframes. Then video level feature is obtained by average pooling the responses of all the keyframes of one video. We name this feature as SIN14.346. <2>. We select 497 concepts from MED14 Research Collection dataset, annotate at most 200 positive keyframes for each concept and train 497 concept classifiers. Similar in spirit, we use these 497 concept classifiers to do the prediction on all the keyframes and fuse the responses on all the keyframes of one video to form a feature vector on video level by average pooling. It is named as Research.Collection.497. <3>. DCNN [4][5] are trained on ImageNet 2012 dataset. Non-linear outputs of two layers (Layer 7 and Layer 8) are treated as our DCNN features. We use average pooling to fuse the two kinds of DCNN features of all the keyframes of one video to form the video-level feature vector respectively. For simplicity, the two are named as DCNN7 and ImageNet.ILSVRC12.1000, respectively. In this submission, for different features, we use SVM [3] with different kernels. <1>. For DCNN7, event classifiers are trained using Chi-Square SVM <2>. We concatenate SIN14.346, Research.Collection.497, and ImageNet.ILSVRC12.1000 features to one feature vector, and train event classifier using Chi-Square SVM. Average fusion is used to combine prediction scores from multiple SVMs described above and the scores of PS_000Ex for PS_010Ex. In addition, videos with OCR which matches to the keywords extracted from event description obtained extra score (bonus) equivalent to the tf-idf text score. ----section 1.3 VIREO_MED14_EvalFull_PS_100Ex---- In 100Ex system, we first decompose each video into two forms: keyframes and video clips. The keyframe sampling rate is set to be one frame every two seconds and clip time duration is set to be five seconds. Metadata language in metadata store is English. We use tesseract-ocr [7] to extract OCR (English) at keyframe level. We extracted the following raw features at keyframe or clip level: <1>. Improved Dense Trajectory [2], extracted at clip level <2>. MFCC, LPC, LSF, OBSI, extracted at clip level <3>. SIN14.346, extracted at keyframe level <4>. Research.Collection.497, extracted at keyframe level <5>. DCNN7, ImageNet.ILSVRC12.1000 [4][5], extracted at keyframe level We use different methods to generate feature vector for each video. In detail <1>. For Improved Dense Trajcetory, MFCC, LPC, LSF, and OBSI, fisher vector [1] is used to generate the feature vector for each video clip. We use average pooling strategy for improved dense trajectory and max pooling strategy for MFCC, LPC, LSF, OBSI to pool clip level feature to form a feature vector for each video. <2>. 346 concept detectors trained on TRECVID 2014 SIN task are first used to do the prediction on all the keyframes. Then video level feature is obtained by average pooling the responses of all the keyframes of one video. We name this feature as SIN14.346. <3>. We select 497 concepts from MED14 Research Collection dataset, annotate at most 200 positive keyframes for each concept and train 497 concept classifiers. Similar in spirit, we use these 497 concept classifiers to do the prediction on all the keyframes and fuse the responses on all the keyframes of one video to form a feature vector on video level by average pooling. It is named as Research.Collection.497. <4>. DCNN [4][5] are trained on ImageNet 2012 dataset. Non-linear outputs of two layers (Layer 7 and Layer 8) are treated as our DCNN features. We use average pooling to fuse the two kinds of DCNN features of all the keyframes of one video to form the video-level feature vector respectively. For simplicity, the two are named as DCNN7 and ImageNet.ILSVRC12.1000, respectively. In this submission, for different features, we use SVM [3] with different kernels. <1>. For Improved Dense Trajectory, MFCC, LPC, LSF and OBSI, event classifiers are trained using linear SVM. <2>. For DCNN7, event classifiers are trained using Chi-Square SVM <3>. We concatenate SIN14.346, Research.Collection.497, and ImageNet.ILSVRC12.1000 features to one feature vector, and train event classifier using Chi-Square SVM. Average fusion is used to combine prediction scores from multiple SVMs described above for PS_100Ex. In addition, videos with OCR which matches to the keywords extracted from event description obtained extra score (bonus) equivalent to the tf-idf text score. ----section 1.4 VIREO_MED14_EvalFull_PS_SQ---- For SQ system, we first run the automatic semantic query generation of 000Ex system, and regard the top 30 concepts as candidates. Then we manually remove those concepts that are irrelevant or insignificant to the event. Normally, this would result in 7-11 concepts finally chosen for each event. We manually classify these concepts into categories of objects, actions and scenes. In search phase, the method is identical to the 000Ex system. ----section 1.5 VIREO_MED14_EvalFull_AH_000Ex---- Same procedure as VIREO_MED14_EvalFull_PS_000Ex ----section 1.6 VIREO_MED14_EvalFull_AH_010Ex---- Same procedure as VIREO_MED14_EvalFull_PS_010Ex ----section 1.7 VIREO_MED14_EvalFull_AH_SQ---- Same procedure as VIREO_MED14_EvalFull_PS_SQ ==== Section 2, Metadata Generation Description ==== We first decompose each video into two forms: keyframes and video clips. The keyframe sampling rate is set to be one frame every two seconds and clip time duration is set to be 5 seconds. Metadata language in metadata store is English. We use tesseract-ocr [7] to extract OCR (English) at keyframe level. Moreover, 5784 documents are collected from Wikipedia and indexed in the metadata store for measuring the inverse document frequency (IDF) score. We have 3 sets of concept classifiers: SIN14.346 concept classifiers, Research.Collection.497 concept classifiers, ImagNet.1000 concept classifiers. <1>. SIN14.346 concept classifiers are trained by using TRECVID SIN 2014 dataset. <2>. Research.Collection.497 concept classifiers are trained on MED14 Research Collection dataset. We selected 497 concepts from MED14 Research Collection dataset, annotated 200 positive keyframes for each concept and trained 497 concept classifiers. <3>. ImageNet.ILSVRC12.1000 concept classifiers are trained on ImagNet 2012 dataset. The non-linear outputs of DCNN [5] last layer are as the concept prediction scores of ImageNet.ILSVRC12.1000 classifiers. We use the 3 sets of concept classifiers to extract concept features on keyframe-level. The other low-level features such as improved dense trajectory, MFCC etc are also stored in metadata. The processing procedure is described in section 1. ==== Section 3, Semantic Query Generator Description ==== Semantic query generator takes the event description of each event as input. The event descriptions are then parsed by Stanford CoreNLP parser [6] which analyzes the structure of sentences in terms of phrases and verbs. Note that the event explications are excluded from queries since complex sentences are too difficult to be well parsed. We also enforce lemmatization on both queries and concept names. The following steps are subsequently executed to generate the semantic query per event: <1>. Concepts are automatically selected by keyword matching between concept names and query. These selected concepts are weighted by considering the term frequency in query, term IDF, term relevance to the query, and term specificity which refers to the depth of the corresponding synset in WordNet hierarchy <2>. All selected concepts are ranked by weights. Then we choose between automatic and manual generation by task. <3>. For automatic semantic query generation, the top 8 concepts are picked up for each event. The following concepts are reserved if their weights are equal to the weight of the 8th concept. This would normally result in 8-10 concepts finally chosen for each event. These chosen concepts are the semantic query automatically generated by our system. <4>. For manual semantic query generation, we first regard the top 30 concepts as candidates, then manually remove those concepts that are irrelevant or insignificant to the event. Normally, this would result in 7-11 concepts finally chosen for each event. <5>. We manually classify these concepts into categories of objects, actions and scenes for good readability. ==== Section 4, Event Query Generator Description ==== Each event is defined by a description file and several positive exemplars. And, event query is a combination of event detector (trained from exemplars) and semantic query. Since, semantic query is generated in the previous section, only event detector should be provided in this section. For EQ_010Ex and EQ_100Ex, we extract features from positive exemplars and background exemplars which are treated as negative exemplars. Then, we use SVM [3] to train event detectors. ==== Section 5, Event Search Module Description ==== Generate rank list: <1>. In ES_SQ, for each event, we rank all test videos by the weighted sum of detector confidence of all chosen concepts. The weights are already determined when generating the semantic query. <2>. In ES_000Ex, for each event, we rank all test videos by the weighted sum of detector confidence of all chosen concepts. The weights are already determined when generating the semantic query, and the final score of a video is further adjusted by the keyword matching between keywords selected from event description and OCR extracted from each keyframe. <3>. In ES_010Ex, we average prediction score of event detector and relevance score obtained in ES_SQ subtask to generate the rank list. As mentioned in section 1, the keyword matching between keywords of event description and OCR of each keyframe is used to refine the video score. <4>. In ES_100Ex, only event detector prediction score is used to rank videos. For videos with OCR, the scores will be refined by using the keyword matching between keywords of event description and OCR of each keyframe. Generate threshold: Theoretically, positive and negative event videos should have different distributions in the view of their prediction scores. Thus, threshold can be obtained by maximize the margin between positive and negative distributions. The maximum entropy theory is used to deal with this problem in image segmentation area. We directly use this algorithm to search threshold. ==== Section 6, Training data and knowledge sources ==== <1>. The TRECVID Semantic Indexing 2014 dataset was used to train SIN14.346 concept classifiers. <2>. ImageNet Grand Challenge 2012 dataset was used to train deep convolutional neural networks, which form the ImageNet.ILSVRC12.1000 concept classifiers. <3>. 5784 Wikipedia pages <4>. Wordnet 3.0 ==== Section 7, Reference ==== [1] Fisher vector encoding: https://lear.inrialpes.fr/src/inria_fisher/ [2] Improved Trajectory encoding: https://lear.inrialpes.fr/people/wang/download/improved_trajectory_realease.tar.gz [3] LIBSVM: http://www.csie.ntu.edu.tw/~cjlin/libsvm/ [4] DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition: https://github.com/UCB-ICSI-Vision-Group/decaf-release/ [5] Caffe: Deep Learning framework:http://caffe.berkeleyvision.org/ [6] Stanford CoreNLP: http://nlp.stanford.edu/software/corenlp.shtml [7] tesseract-ocr: https://code.google.com/p/tesseract-ocr/