++ SHoUT file formats ++

This document describes the format of the SHoUT files as they are
provided for the 2007 TRECVID evaluation. For each video the word
based Automatic Speech Recognition (ASR) output in native SHoUT XML
format and in MPEG7 format are provided. Also for each segment (see
[1]) of each video file a word based lattice is provided.

++ Native SHoUT XML output ++

For each video file, one XML based automatic speech recognition output file is created. The root element is "shout_metadata". The elements of "shout_metadata" are:
- model_info                  contains the names of the used models during decoding (AM, DCT, LM and VTLN).
- decoding_settings           contains the parameter settings for decoding
- segments                    contains the actual ASR output
- statistics                  contains timing information

The ASR output is in the "segments" element. The "segments" element consists of a list of "speaker" elements. Each speaker has a unique "label" attribute and a list of "speech" elements. Each "speech" element contains the speech from a single segment (speech is not ordered in time, but per speaker and then on time). The "speech" element contains the following elements and attributes:
- label       (attribute)     contains a unique label for the speech segment. The format of the label is: [speaker label]-[incrementing number]
- begintime   (attribute)     start time of this sentence in seconds (Relative from the beginning of the file)
- endtime     (attribute)     end time of this sentence in seconds (Relative from the beginning of the file)
- real_time                   information on the time it took to decode this segment
- score                       the overall score of this speech segment (AM and LM scores combined)
- wordsequence                the actual sequence of words

The "wordsequence" element contains a list of "word" elements. Each "word" element contains:
- wordID      (attribute)     the recognized word (*)
- begintime   (attribute)     start time of this word in seconds (Relative from the beginning of the file)
- endtime     (attribute)     end time of this word in seconds (Relative from the beginning of the file)
- score                       the score of this word: the AM likelihood, LM prior and the combined score

(*) The contents of each wordID element is the actual word. A special
'word' is the "[s]" word. The decoder will output this word if it
recognized silence (note that there are no gaps in time between
words. This is because of this 'silence' word).

++ Lattice output ++

The lattices of each video are stored in a separate directory. This
directory contains the lattice output of each speech segment (each
"speech" element in the native SHoUT XML output) of that file. The
name of each lattice is identical to the "label" attribute of the
"speech" element. The lattices are stored in PSFG format. This is the
native format for the language model toolkit SRILM. See [2] for more
information on this file format. The 'transition costs' in the lattice
files should, according to the standard, be normalized. This is not
the case. The 'transition cost' score is identical to the acoustic
score of the word (no language model score is incorporated).

++ MPEG7 format ++

[To be done]


++ References ++

[1] "Speech-based Annotation of Heterogeneous Multimedia Content Using Automatic Speech Recognition" (added to this release)
[2] http://www.speech.sri.com/projects/srilm/manpages/pfsg-format.html


++ SHoUT file formats ++

This document describes the format of the SHoUT files as they are
provided for the 2007 TRECVID evaluation. For each video the word
based Automatic Speech Recognition (ASR) output in native SHoUT XML
format and in MPEG7 format are provided. Also for each segment (see
[1]) of each video file a word based lattice is provided.

++ Native SHoUT XML output ++

For each video file, one XML based automatic speech recognition output file is created. The root element is "shout_metadata". The elements of "shout_metadata" are:
- model_info                  contains the names of the used models during decoding (AM, DCT, LM and VTLN).
- decoding_settings           contains the parameter settings for decoding
- segments                    contains the actual ASR output
- statistics                  contains timing information

The ASR output is in the "segments" element. The "segments" element consists of a list of "speaker" elements. Each speaker has a unique "label" attribute and a list of "speech" elements. Each "speech" element contains the speech from a single segment (speech is not ordered in time, but per speaker and then on time). The "speech" element contains the following elements and attributes:
- label       (attribute)     contains a unique label for the speech segment. The format of the label is: [speaker label]-[incrementing number]
- begintime   (attribute)     start time of this sentence in seconds (Relative from the beginning of the file)
- endtime     (attribute)     end time of this sentence in seconds (Relative from the beginning of the file)
- real_time                   information on the time it took to decode this segment
- score                       the overall score of this speech segment (AM and LM scores combined)
- wordsequence                the actual sequence of words

The "wordsequence" element contains a list of "word" elements. Each "word" element contains:
- wordID      (attribute)     the recognized word (*)
- begintime   (attribute)     start time of this word in seconds (Relative from the beginning of the file)
- endtime     (attribute)     end time of this word in seconds (Relative from the beginning of the file)
- score                       the score of this word: the AM likelihood, LM prior and the combined score

(*) The contents of each wordID element is the actual word. A special
'word' is the "[s]" word. The decoder will output this word if it
recognized silence (note that there are no gaps in time between
words. This is because of this 'silence' word).

++ Lattice output ++

The lattices of each video are stored in a sepparate directory. This
directory contains the lattice output of each speech segment (each
"speech" element in the native SHoUT XML output) of that file. The
name of each lattice is identical to the "label" attribute of the
"speech" element. The lattices are stored in PSFG format. This is the
native format for the language model toolkit SRILM. See [2] for more
information on this file format. The 'transition costs' in the lattice
files should, according to the standard, be normalized. This is not
the case. The 'transition cost' score is identical to the acoustice
score of the word (no language model score is incorperated).

++ MPEG7 format ++

The wikipedia page is a nice place to start reading about this standard. [3]

++ References ++

[1] "Speech-based Annotation of Heterogeneous Multimedia Content Using Automatic Speech Recognition" (added to this release)
[2] http://www.speech.sri.com/projects/srilm/manpages/pfsg-format.html
[3] http://en.wikipedia.org/wiki/MPEG-7