# Readme: March 2016
# TrecVID 2016 automatic transcriptions
# copyright LIMSI-CNRS / Vocapia Research
Of the 4593 video files of the IACC.3 collection, 41 did not have an audio
track. For the 4552 that had one, the audio track was extracted by LIG and
sent to LIMSI for language identification and transcription.
Of the 4552 wav files provided, the audio partitioner does not detect any
speech in 94 files (one of which has a null track length). The transcripts
are in the directory xml.
noaudio.lst extractor (41 files)
nospeech.lst partitioner (94 files)
nowords_asr2.lst asr (63 files)
Since the language of the audio data is unknown, this was automatically
identified using the Vocapia Research/LIMSI language identification system
v4.2. It is assumed that the audio file contains speech in only one language.
If no STT system was available for the detected language, the file was transcribed
using the English STT system.
list: nomodel.lst 1184 files (*.unknown.eng.xml)
If a transcription system exists for the detected language, the processing
depends on the language confidence score (lconf in the lid xml files). If the
LID score was 0.75 or higher (2186 files have a lconf >= 0.75), the audio file
was transcribed with the detected language.
list: lid_075+.lst
For the files in this category, the distribution in terms of language is:
1866 eng
125 spa
49 ger
36 por
31 fre
19 slo - no STT system
19 ita
14 dut
10 ara
7 tur
3 pol
3 hun - no STT system
2 gre
1 swe - no STT system
1 chi
STT systems are available for all but 3 of the detected languages (slo, hun
and swe).
list: lid_075-.lst
Files with an LID confidence of under 0.75 (2271 files) for which a
transcription system exists (448 files), were transcribed twice, once with the
detected language (if other than English), and once with the English STT
system (*.forced.eng.xml).
The detected language distribution for the files transcribed twice is
158 spa
81 ger
64 ara
50 por
34 fre
27 dut
8 ita
6 pol
5 rus
4 tur
4 gre
3 rum
3 lav
1 chi
list: twice.lst
These transcripts are in the directory xml2. Sometimes when the audio is
transcribed with the detected language, no words are found. Usually the lconf
score is low in these cases. (nowords_asr2.lst)
The transcripts include filler words ({fw}) and breath ({breath}), multiple
hypotheses from consensus network decoding. and are not filtered to remove low
confidence words.
Information about the ASR file format
-------------------------------------
At the start of the xml file is the list of speakers found in the file.
For each speaker the detected gender is: male (gender="1") and female
(gender="2"). tconf is the confidence score for the transcription (full doc
or by speaker) sconf and lconf are the speech/nonspeech and language
identification confidence score. nw is the total number of words (in the full
doc and also per speaker) foreach word there is the start time and duration
and word conf score the trs="1" means that it was automatically transcribed
after the speaker list is the list of segments
y
here there is again the speech/non-speech confidence scorre the start and end
times of the segment, the speaker and language
then there is an entry for each word
and
Acknowledgment required if you use the transcriptions
-----------------------------------------------------
The models used by the system have been updated with partial support
from the Quaero program.
J.-L. Gauvain. The Quaero Program: Multilingual and Multimedia
Technologies IWSLT 2010, Paris, Dec. 2010.
L. Lamel. Multilingual Speech Processing Activities in Quaero:
Application to Multimedia Search in Unstructured Data The Fifth
International Conference Human Language Technologies - The Baltic
Perspective Tartu, Estonia, October 4-5, 2012