# Readme: March 2016 # TrecVID 2016 automatic transcriptions # copyright LIMSI-CNRS / Vocapia Research Of the 4593 video files of the IACC.3 collection, 41 did not have an audio track. For the 4552 that had one, the audio track was extracted by LIG and sent to LIMSI for language identification and transcription. Of the 4552 wav files provided, the audio partitioner does not detect any speech in 94 files (one of which has a null track length). The transcripts are in the directory xml. noaudio.lst extractor (41 files) nospeech.lst partitioner (94 files) nowords_asr2.lst asr (63 files) Since the language of the audio data is unknown, this was automatically identified using the Vocapia Research/LIMSI language identification system v4.2. It is assumed that the audio file contains speech in only one language. If no STT system was available for the detected language, the file was transcribed using the English STT system. list: nomodel.lst 1184 files (*.unknown.eng.xml) If a transcription system exists for the detected language, the processing depends on the language confidence score (lconf in the lid xml files). If the LID score was 0.75 or higher (2186 files have a lconf >= 0.75), the audio file was transcribed with the detected language. list: lid_075+.lst For the files in this category, the distribution in terms of language is: 1866 eng 125 spa 49 ger 36 por 31 fre 19 slo - no STT system 19 ita 14 dut 10 ara 7 tur 3 pol 3 hun - no STT system 2 gre 1 swe - no STT system 1 chi STT systems are available for all but 3 of the detected languages (slo, hun and swe). list: lid_075-.lst Files with an LID confidence of under 0.75 (2271 files) for which a transcription system exists (448 files), were transcribed twice, once with the detected language (if other than English), and once with the English STT system (*.forced.eng.xml). The detected language distribution for the files transcribed twice is 158 spa 81 ger 64 ara 50 por 34 fre 27 dut 8 ita 6 pol 5 rus 4 tur 4 gre 3 rum 3 lav 1 chi list: twice.lst These transcripts are in the directory xml2. Sometimes when the audio is transcribed with the detected language, no words are found. Usually the lconf score is low in these cases. (nowords_asr2.lst) The transcripts include filler words ({fw}) and breath ({breath}), multiple hypotheses from consensus network decoding. and are not filtered to remove low confidence words. Information about the ASR file format ------------------------------------- At the start of the xml file is the list of speakers found in the file. For each speaker the detected gender is: male (gender="1") and female (gender="2"). tconf is the confidence score for the transcription (full doc or by speaker) sconf and lconf are the speech/nonspeech and language identification confidence score. nw is the total number of words (in the full doc and also per speaker) foreach word there is the start time and duration and word conf score the trs="1" means that it was automatically transcribed after the speaker list is the list of segments y here there is again the speech/non-speech confidence scorre the start and end times of the segment, the speaker and language then there is an entry for each word and Acknowledgment required if you use the transcriptions ----------------------------------------------------- The models used by the system have been updated with partial support from the Quaero program. J.-L. Gauvain. The Quaero Program: Multilingual and Multimedia Technologies IWSLT 2010, Paris, Dec. 2010. L. Lamel. Multilingual Speech Processing Activities in Quaero: Application to Multimedia Search in Unstructured Data The Fifth International Conference Human Language Technologies - The Baltic Perspective Tartu, Estonia, October 4-5, 2012