# Readme:  March 2016
# TrecVID 2016 automatic transcriptions
# copyright LIMSI-CNRS / Vocapia Research 

Of the 4593 video files of the IACC.3 collection, 41 did not have an audio
track. For the 4552 that had one, the audio track was extracted by LIG and
sent to LIMSI for language identification and transcription.

Of the 4552 wav files provided, the audio partitioner does not detect any
speech in 94 files (one of which has a null track length).  The transcripts
are in the directory xml.

 noaudio.lst       extractor    (41 files)
 nospeech.lst      partitioner  (94 files)
 nowords_asr2.lst  asr          (63 files)

Since the language of the audio data is unknown, this was automatically
identified using the Vocapia Research/LIMSI language identification system
v4.2. It is assumed that the audio file contains speech in only one language.

If no STT system was available for the detected language, the file was transcribed
using the English STT system.

 list: nomodel.lst  1184 files  (*.unknown.eng.xml)

If a transcription system exists for the detected language, the processing
depends on the language confidence score (lconf in the lid xml files).  If the
LID score was 0.75 or higher (2186 files have a lconf >= 0.75), the audio file
was transcribed with the detected language.

 list: lid_075+.lst

For the files in this category, the distribution in terms of language is:
     1866 eng
      125 spa
       49 ger
       36 por
       31 fre
       19 slo - no STT system
       19 ita
       14 dut
       10 ara
        7 tur
        3 pol
        3 hun - no STT system
        2 gre
        1 swe - no STT system
        1 chi

STT systems are available for all but 3 of the detected languages (slo, hun
and swe). 

 list: lid_075-.lst 

Files with an LID confidence of under 0.75 (2271 files) for which a
transcription system exists (448 files), were transcribed twice, once with the
detected language (if other than English), and once with the English STT
system (*.forced.eng.xml).

The detected language distribution for the files transcribed twice is
      158 spa
       81 ger
       64 ara
       50 por
       34 fre
       27 dut
        8 ita
        6 pol
        5 rus
        4 tur
        4 gre
        3 rum
        3 lav
        1 chi

  list: twice.lst

These transcripts are in the directory xml2.  Sometimes when the audio is
transcribed with the detected language, no words are found. Usually the lconf
score is low in these cases.  (nowords_asr2.lst)

The transcripts include filler words ({fw}) and breath ({breath}), multiple
hypotheses from consensus network decoding. and are not filtered to remove low
confidence words.


Information about the ASR file format
-------------------------------------

At the start of the xml file is the list of speakers found in the file.

For each speaker the detected gender is: male (gender="1") and female
(gender="2").  tconf is the confidence score for the transcription (full doc
or by speaker) sconf and lconf are the speech/nonspeech and language
identification confidence score. nw is the total number of words (in the full
doc and also per speaker) foreach word there is the start time and duration
and word conf score the trs="1" means that it was automatically transcribed

<SpeakerList> 
<Speaker ch="1" dur="9.36" gender="2" spkid="1" lang="eng-usa" lconf="1.00" nw="12" tconf="0.84"/>
</SpeakerList>

after the speaker list is the list of segments
<SegmentList>
<SpeechSegment ch="1" sconf="1.00" stime="1.53" etime="10.89" spkid="1" lang="eng-usa" lconf="1.00" trs=1">y

here there is again the speech/non-speech confidence scorre the start and end
times of the segment, the speaker and language

then there is an entry for each word
<Word stime="1.61" dur="0.20" conf="0.741"> and </Word>


Acknowledgment required if you use the transcriptions
-----------------------------------------------------

The models used by the system have been updated with partial support
from the Quaero program.

 J.-L. Gauvain. The Quaero Program: Multilingual and Multimedia
 Technologies IWSLT 2010, Paris, Dec. 2010.

 L. Lamel. Multilingual Speech Processing Activities in Quaero:
 Application to Multimedia Search in Unstructured Data The Fifth
 International Conference Human Language Technologies - The Baltic
 Perspective Tartu, Estonia, October 4-5, 2012