The 2018 Video_to_text pilot task evaluation was run on four main metrics:

1- BLEU metric:
mteval-v14/mteval-v14.pl -r tv18.ref.5.bleu.xml -s tv18.src.5.bleu.xml -t runSubmissionFile.xml -d 3 --metricsMATR

Where: 
mteval : is the NIST MT evaluation tool: ftp://jaguar.ncsl.nist.gov/mt/resources/mteval-v14.pl 

tv18.ref.5.bleu.xml : is the reference (GT) textual description in xml format compatible with the mteval tool using 5 references per video.

tv18.src.5.bleu.xml : is the source language to be translated. In the VTT task case there is no source language. And thus the contents is
the same as the tv18.ref.5.bleu.xml. However mteval tool uses this file just to extract the number of source sentences (videos) to be translated.

runSubmissionFile : is the run submission file in xml format compatible with the mteval tool.

-d 3 : determines the output files (document/video level, system level, detail scores level). using -d 3 will produce 3 files -doc.scr, -sys.scr and scores file.

Also, the tool produces results for a specific NIST metric which is not bounded and as a result is not very popular.
It is a supported metric that addresses some shortcomings in the BLEU metric. However, we didn't
report any of the NIST metrics as official results at this point.

2- METEOR metric:
java -Xmx2G -jar meteor-*.jar runSubmissionFile.txt tv18.ref.5.meteor -l en -norm -r 5 -t adq

Where:

runSubmissionFile : is the run submission file in a meteor tool compatible format.

tv18.ref.5.meteor : is the reference (GT) in txt format compatible with the meteor tool using 5 references (G.T).
The parameter -r decides how many references are being used, which is 5 for the 2018 VTT task.

The tool can be downloaded from http://www.cs.cmu.edu/~alavie/METEOR/


3- STS experimental metric (Semantic Similarity):

To score a similarity between 2 textual descriptions, please use this API call.
wget "http://swoogle.umbc.edu/StsService/GetStsSim?operation=api&phrase1=$s1&phrase2=$s2" --wait=5 --user-agent='NIST - TRECVID' -q -O - 2>&1`

where:
$s1 : a run submitted textual description sentence for a single video.
$s2 : the GT textual description sentence. All G.T files including different subsets of testing URLs is available (tv18.vtt.ref.X.sts) where X (2,3,4 or 5) is a subset of the total references available.


4- CIDEr metric: 

  - please download the CIDEr code available at https://github.com/vrama91/cider.
  - The file cidereval.py evaluates the submission with reference to the provided groundtruth. 
  - The code uses the params.json file to determine the input and output files.
    Edit the params.json file to make "refName" as "vtt.gt.cider.json" and "candName" as your JSON submission filename. 
    Please ensure that "idf" is set to "corpus".

    For further information on how to run the sample code, please refer to README.md.