— TRECVID 2017 guidelines

Video data

A number of datasets are available for use in TRECVID 2017 and are described below.

Once you know which tasks you will be participating in, you can determine which data sets you need.
Then for each needed dataset, see below for information on how you get permission to use the data and how it will be distributed..

IACC.3

The IACC.3 dataset is approximately 4600 Internet Archive videos (144 GB, 600 h) with Creative Commons licenses in MPEG-4/H.264 format with duration ranging from 6.5 min to 9.5 min and a mean duration of almost 7.8 min. Most videos will have some metadata provided by the donor available e.g., title, keywords, and description.

Data use agreements and Distribution: Download for active participants from NIST/mirror servers. See Data use agreements

Master shot reference: Available by download from the TRECVID Past Data page

Automatic speech recognition (for English): Available by download from the TRECVID Past Data page

IACC.2.A-C

Three datasets (A,B,C) - totaling approximately 7300 Internet Archive videos (144 GB, 600 h) with Creative Commons licenses in MPEG-4/H.264 format with duration ranging from 10 s to 6.4 min and a mean duration of almost 5 min. Most videos will have some metadata provided by the donor available e.g., title, keywords, and description.

NOTE: Be sure to reload the relevant collection.xml files (A, B, C) in the master shot reference and remove files with a "use" attribute set to "dropped" - these are no longer available under a Creative Commons license and are not part of the test collection.

Data use agreements and Distribution: Download for active participants from NIST/mirror servers. See Data use agreements

Master shot reference: Available by download from the TRECVID Past Data page

Automatic speech recognition (for English): Available by download from the TRECVID Past Data page

IACC.1.A-C

Three datasets (A,B,C) - totaling approximately 8000 Internet Archive videos (160 GB, 600 h) with Creative Commons licenses in MPEG-4/H.264 format with duration between 10s and 3.5 min. Most videos will have some metadata provided by the donor available e.g., title, keywords, and description

Data use agreements and Distribution: Available by download from the Internet Archive. See TRECVID Past Data page. Or download from the copy on the Dublin City University server, but use the collection.xml files (see TRECVID past data page) for instructions on how to check the current availability of each file.

Master shot reference: Available by download from the TRECVID Past Data page

Automatic speech recognition (for English): Available by download from the TRECVID Past Data page

IACC.1.tv10.training

Approximately 3200 Internet Archive videos (50 GB, 200 h) with Creative Commons licenses in MPEG-4/H.264 format with durations between 3.6 and 4.1 min Most videos will have some metadata provided by the donor available e.g., title, keywords, and description

Data use agreements and Distribution: Available by download from the Internet Archive. See TRECVID Past Data page. Or download from the copy (see tv2010 directory) on the Dublin City University server, but use the collection.xml files (see TRECVID past data page) for instructions on how to check the current availability of each file.

Master shot reference: Available by download from the TRECVID Past Data page

Common feature annotation: Available by download from the TRECVID Past Data page

Automatic speech recognition (for English): Available by download from the TRECVID Past Data page

Gatwick and i-LIDS MCT airport surveillance video

The data consist of about 150 h of airport surveillance video data (courtesy of the UK Home Office). The Linguistic Data Consortium has provided event annotations for the entire corpus. The corpus was divided into development and evaluation subsets. Annotations for 2008 development and test sets are available.

Data use agreements and Distribution:

Gatwick development data (2008 DevSet and 2008 EvalSet) by download from password-protected servers at NIST and mirror sites. See Data use agreements
2009 i-LIDS test data from United Kingdom's Centre for Applied Science and Technology (CAST) can be downloaded from NIST but only after CAST has received the required information and issued a userid/password. See here for details.

Development data annotations: available by download.

BBC EastEnders

Approximately 244 video files (totally 300 GB, 464 h) with associated metadata, each containing a week's worth of BBC EastEnders programs in MPEG-4/H.264 format.

Data use agreements and Distribution: Download and fill out the data permission agreement from the active participants' area of the TRECVID website. After the agreement has been processed by NIST and the BBC, the applicant will be contacted by Dublin City University with instructions on how to download from their servers. See Data use agreements

Master shot reference: Will be available to active participants by download from the TRECVID 2017 active participant's area.

Automatic speech recognition (for English): Will be available to active participants by download from Dublin City University.

Blip10000 data set for Video Hyperlinking

The data set consists of:

Videos, shot segmentation, metadata, ASR transcript (version 2013): available at http://skuld.cs.umass.edu/traces/mmsys/2013/blip/Blip10000.html
ASR transcript (version 2016), visual concepts extraction, released by the task organisers from a server at University of Twente in the Netherlands.

YFCC100M

The Yahoo Flickr Creative Commons 100M dataset (YFCC100M) is a large collection of images and video available on Yahoo! Flickr. All photos and videos listed in the collection are licensed under one of the Creative Commons copyright licenses.

The YFCC100M dataset is comprised of:
* 99.3 million images
* 0.7 million videos

Data use agreements and Distribution:
The YFCC100M dataset can be obtained directly from Yahoo! from this link

HAVIC

HAVIC is a large collection of Internet multimedia constructed by the Linguistic Data Consortium and NIST. Participants will receive training corpora, event training resources, and two development test collections.

Data use agreements and Distribution: Data licensing and distribution will be handled by the Linguistic Data Consortium. The MED'17 website is up and operational. Currently, only the data license agreement will be on the site. All teams (even pastparticipants) must submit a license agreement to the LDC.

Twitter Vine Videos

Available for training data is the 2016 pilot VTT testing data (a set of about 2000 Vine URLs and their ground truth descriptions)

In 2017, NIST will distribute for active participants a list of new URLs for testing data. Please consult the general schedule for data release and submission of results dates.

Data use agreements handled by NIST (Gatwick (2008), IACC.2, IACC.3, BBC EastEnders)

In order to be eligible to receive the data, you must have have applied for participation in TRECVID. Your application will be acknowledged by NIST with a team ID, and active participant's password, and information about how to obtain the data.

If you will be using i-LIDS (2009), HAVIC data, or Blib10000 for Hyperlinking, NIST will NOT be handling the data use agreements. See the "Data Use Agreements and Distribution" section for i-LIDS, HAVIC, or Bilb10000 for Hyperlinking.

If you will be using IACC.1 video, the data use agreements are available from the "Past data" webpage. You will be downloading the data from the Dublin City University server (see above) or the Internet Archive. See the "Data Use Agreements and Distribution" section for IACC.1

If you will be needing to get a copy of Gatwick(2008), IACC.2, IACC.3, or BBC EastEnders data you will need to complete the relevant permission forms (from the active participant's area) and email the scanned page images for each form as one Adobe Acrobat pdf of the document to Angela Ellis.
Note that all of the IACC.2 and IACC.3 data was made available last year. So if you signed the permission form last year and do not need to replace your original copy then you do not need to submit another permission form this year.
In your email include the following:
```
As Subject: "TRECVID data request"
In the body: your name
             your short team ID (given when you applied to participate)
             the kinds of data you will be using - one or more of the following:
         Gatwick (2008), IACC.2, IACC.3, and/or BBC EastEnders 
```
You will receive instructions on how to download the data.

Please ask only for the test data (and optional development data) required for the task(s) you apply to participate in and intend to complete.

Requests are handled in the order they are received. Please allow 5 business days for NIST to respond to your request. To download the Gatwick or IACC data you need to use the access codes sent to you by email and the information about data servers in the the active participant's area.

Requests for the EastEnders data are forwarded within 5 business days to the BBC and from there to DCU, who will contact you with the download information. This process may take up to 3 weeks.