++ SHoUT file formats ++ This document describes the format of the SHoUT files as they are provided for the 2007 TRECVID evaluation. For each video the word based Automatic Speech Recognition (ASR) output in native SHoUT XML format and in MPEG7 format are provided. Also for each segment (see [1]) of each video file a word based lattice is provided. ++ Native SHoUT XML output ++ For each video file, one XML based automatic speech recognition output file is created. The root element is "shout_metadata". The elements of "shout_metadata" are: - model_info contains the names of the used models during decoding (AM, DCT, LM and VTLN). - decoding_settings contains the parameter settings for decoding - segments contains the actual ASR output - statistics contains timing information The ASR output is in the "segments" element. The "segments" element consists of a list of "speaker" elements. Each speaker has a unique "label" attribute and a list of "speech" elements. Each "speech" element contains the speech from a single segment (speech is not ordered in time, but per speaker and then on time). The "speech" element contains the following elements and attributes: - label (attribute) contains a unique label for the speech segment. The format of the label is: [speaker label]-[incrementing number] - begintime (attribute) start time of this sentence in seconds (Relative from the beginning of the file) - endtime (attribute) end time of this sentence in seconds (Relative from the beginning of the file) - real_time information on the time it took to decode this segment - score the overall score of this speech segment (AM and LM scores combined) - wordsequence the actual sequence of words The "wordsequence" element contains a list of "word" elements. Each "word" element contains: - wordID (attribute) the recognized word (*) - begintime (attribute) start time of this word in seconds (Relative from the beginning of the file) - endtime (attribute) end time of this word in seconds (Relative from the beginning of the file) - score the score of this word: the AM likelihood, LM prior and the combined score (*) The contents of each wordID element is the actual word. A special 'word' is the "[s]" word. The decoder will output this word if it recognized silence (note that there are no gaps in time between words. This is because of this 'silence' word). ++ Lattice output ++ The lattices of each video are stored in a separate directory. This directory contains the lattice output of each speech segment (each "speech" element in the native SHoUT XML output) of that file. The name of each lattice is identical to the "label" attribute of the "speech" element. The lattices are stored in PSFG format. This is the native format for the language model toolkit SRILM. See [2] for more information on this file format. The 'transition costs' in the lattice files should, according to the standard, be normalized. This is not the case. The 'transition cost' score is identical to the acoustic score of the word (no language model score is incorporated). ++ MPEG7 format ++ [To be done] ++ References ++ [1] "Speech-based Annotation of Heterogeneous Multimedia Content Using Automatic Speech Recognition" (added to this release) [2] http://www.speech.sri.com/projects/srilm/manpages/pfsg-format.html ++ SHoUT file formats ++ This document describes the format of the SHoUT files as they are provided for the 2007 TRECVID evaluation. For each video the word based Automatic Speech Recognition (ASR) output in native SHoUT XML format and in MPEG7 format are provided. Also for each segment (see [1]) of each video file a word based lattice is provided. ++ Native SHoUT XML output ++ For each video file, one XML based automatic speech recognition output file is created. The root element is "shout_metadata". The elements of "shout_metadata" are: - model_info contains the names of the used models during decoding (AM, DCT, LM and VTLN). - decoding_settings contains the parameter settings for decoding - segments contains the actual ASR output - statistics contains timing information The ASR output is in the "segments" element. The "segments" element consists of a list of "speaker" elements. Each speaker has a unique "label" attribute and a list of "speech" elements. Each "speech" element contains the speech from a single segment (speech is not ordered in time, but per speaker and then on time). The "speech" element contains the following elements and attributes: - label (attribute) contains a unique label for the speech segment. The format of the label is: [speaker label]-[incrementing number] - begintime (attribute) start time of this sentence in seconds (Relative from the beginning of the file) - endtime (attribute) end time of this sentence in seconds (Relative from the beginning of the file) - real_time information on the time it took to decode this segment - score the overall score of this speech segment (AM and LM scores combined) - wordsequence the actual sequence of words The "wordsequence" element contains a list of "word" elements. Each "word" element contains: - wordID (attribute) the recognized word (*) - begintime (attribute) start time of this word in seconds (Relative from the beginning of the file) - endtime (attribute) end time of this word in seconds (Relative from the beginning of the file) - score the score of this word: the AM likelihood, LM prior and the combined score (*) The contents of each wordID element is the actual word. A special 'word' is the "[s]" word. The decoder will output this word if it recognized silence (note that there are no gaps in time between words. This is because of this 'silence' word). ++ Lattice output ++ The lattices of each video are stored in a sepparate directory. This directory contains the lattice output of each speech segment (each "speech" element in the native SHoUT XML output) of that file. The name of each lattice is identical to the "label" attribute of the "speech" element. The lattices are stored in PSFG format. This is the native format for the language model toolkit SRILM. See [2] for more information on this file format. The 'transition costs' in the lattice files should, according to the standard, be normalized. This is not the case. The 'transition cost' score is identical to the acoustice score of the word (no language model score is incorperated). ++ MPEG7 format ++ The wikipedia page is a nice place to start reading about this standard. [3] ++ References ++ [1] "Speech-based Annotation of Heterogeneous Multimedia Content Using Automatic Speech Recognition" (added to this release) [2] http://www.speech.sri.com/projects/srilm/manpages/pfsg-format.html [3] http://en.wikipedia.org/wiki/MPEG-7