Guidelines for the TRECVID 2006 Evaluation

(last updated: Tuesday, 16-Jan-2007 18:50:56 UTC)

0. Table of Contents:


1. Goal:

The main goal of the TREC Video Retrieval Evaluation (TRECVID) is to promote progress in content-based retrieval from digital video via open, metrics-based evaluation.


2. Tasks:

TRECVID is a laboratory-style evaluation that attempts to model real world situations or significant component tasks involved in such situations. In 2006 TRECVID will complete a 2-year cycle on English, Arabic, and Chinese news video. There will be three system tasks and associated tests and one exploratory task. Participants must complete at least one of the following 4 tasks in order to attend the workshop:

  1. shot boundary determination
  2. high-level feature extraction
  3. search (interactive, manually-assisted, and/or fully automatic)
  4. rushes exploitation (exploratory)

For past participants, here are some changes to note:

  1. Although the amount of test data from sources used in 2006 will be as large or larger than in 2005, there will be significant additional data from channels and/or programs not included in the data for 2005. This is expected to provide interesting information on the extent to which detectors generalize
  2. Participants in the feature task will be required to submit results for all features from the common annotation data for 2005 listed below. A subset of 20 of those will be chosen by NIST and evaluated. This is intended to encourage generic methods for development of detectors.
  3. Additional effort will be made to ensure that interactive search runs come from experiments designed to allow comparison of runs within a site independent of the main effect of the human in the loop. This will be worked out with the participants before the guidelines are complete.
  4. Following the VACE III goals, topics asking for video of events will be much more frequent this year - exploring the limits of one-keyframe-per-shot approaches for this kind of topic and encouraging exploration beyond those limits.
  5. Most of the data will be distributed again on hard drives but at the suggestion of LDC the file system will be ReiserFS to help avoid the data corruption and access problems encountered in 2005

2.1 Shot boundary detection:

Shots are fundamental units of video, useful for higher-level processing. The task is as follows: identify the shot boundaries with their location and type (cut or gradual) in the given video clip(s)

2.2 High-level feature extraction:

Various high-level semantic features, concepts such as "Indoor/Outdoor", "People", "Speech" etc., occur frequently in video databases. The proposed task will contribute to work on a benchmark for evaluating the effectiveness of detection methods for semantic concepts

The task is as follows: given the feature test collection, the common shot boundary reference for the feature extraction test collection, and the list of feature definitions (see below), participants will return for each feature the list of at most 2000 shots from the test collection, ranked according to the highest possibility of detecting the presence of the feature. Each feature is assumed to be binary, i.e., it is either present or absent in the given reference shot.

All feature detection submissions will be made available to all participants for use in the search task - unless the submitter explicitly asks NIST before submission not to do this.

Description of high-level features to be detected:

The descriptions are those used in the common annotation effort. They are meant for humans, e.g., assessors/annotators creating truth data and system developers attempting to automate feature detection. They are not meant to indicate how automatic detection should be achieved.

If the feature is true for some frame (sequence) within the shot, then it is true for the shot; and vice versa. This is a simplification adopted for the benefits it affords in pooling of results and approximating the basis for calculating recall.

NOTE: In the following, "contains x" is short for "contains x to a degree sufficient for x to be recognizable as x to a human" . This means among other things that unless explicitly stated, partial visibility or audibility may suffice.

Selection of high-level features to be detected:

In 2006 participants in the high-level feature task must submit results for all of the following features. NIST will then choose 10 of the features and evaluate submissions for those. Use the following numbers when submitting the features.

  1. Sports: Shots depicting any sport in action
  2. Entertainment: Shots depicting any entertainment segment in action
  3. Weather: Shots depicting any weather related news or bulletin
  4. Court: Shots of the interior of a court-room location
  5. Office: Shots of the interior of an office setting
  6. Meeting: Shots of a Meeting taking place indoors
  7. Studio: Shots of the studio setting including anchors, interviews and all events that happen in a news room
  8. Outdoor: Shots of Outdoor locations
  9. Building: Shots of an exterior of a building
  10. Desert: Shots with the desert in the background
  11. Vegetation: Shots depicting natural or artificial greenery, vegetation woods, etc.
  12. Mountain: Shots depicting a mountain or mountain range with the slopes visible
  13. Road: Shots depicting a road
  14. Sky: Shots depicting sky
  15. Snow: Shots depicting snow
  16. Urban: Shots depicting an urban or suburban setting
  17. Waterscape_Waterfront: Shots depicting a waterscape or waterfront
  18. Crowd: Shots depicting a crowd
  19. Face: Shots depicting a face
  20. Person: Shots depicting a person (the face may or may not be visible)
  21. Government-Leader: Shots of a person who is a governing leader, e.g., president, prime-minister, chancellor of the exchequer, etc.
  22. Corporate-Leader: Shots of a person who is a corporate leader, e.g., CEO, CFO, Managing Director, Media Manager, etc.
  23. Police_Security: Shots depicting law enforcement or private security agency personnel
  24. Military: Shots depicting the military personnel
  25. Prisoner: Shots depicting a captive person, e.g., imprisoned, behind bars, in jail or in handcuffs, etc.
  26. Animal: Shots depicting an animal, not counting a human as an animal
  27. Computer_TV-screen:Shots depicting a television or computer screen
  28. Flag-US: Shots depicting a US flag
  29. Airplane: Shots of an airplane
  30. Car: Shots of a car
  31. Bus: Shots of a bus
  32. Truck: Shots of a truck
  33. Boat_Ship: Shots of a boat or ship
  34. Walking_Running: Shots depicting a person walking or running
  35. People-Marching: Shots depicting many people marching as in a parade or a protes
  36. t
  37. Explosion_Fire: Shots of an explosion or a fire
  38. Natural-Disaster: Shots depicting the happening or aftermath of a natural disaster such as earthquake, flood, hurricane, tornado, tsunami
  39. Maps: Shots depicting regional territory graphically as a geographical or political map
  40. Charts: Shots depicting any graphics that is artificially generated such as bar graphs, line charts, etc. (maps should not be included)

NOTE: NIST will instruct the assessors during the manual evaluation of the feature task submissions as follows. The fact that a segment contains video of physical objects representing the topic target, such as photos, paintings, models, or toy versions of the topic target, should NOT be grounds for judging the feature to be true for the segment. Containing video of the target within video may be grounds for doing so.

2.3 Search:

Search is high-level task which includes at least query-based retrieval and browsing. The search task models that of an intelligence analyst or analogous worker, who is looking for segments of video containing persons, objects, events, locations, etc. of interest. These persons, objects, etc may be peripheral or accidental to the original subject of the video. The task is as follows: given the search test collection, a multimedia statement of information need (topic), and the common shot boundary reference for the search test collection, return a ranked list of at most 1000 common reference shots from the test collection, which best satisfy the need. Please note the following restrictions for this task:

  1. TRECVID 2006 will accept fully automatic search submissions (no human input in the loop) as well as manually-assisted and interactive submissions as illustrated graphically below

  2. graphic description of run types

  3. Because the choice of features and their combination for search is an open research question, no attempt will be made to restrict groups with respect to their use of features in search. However, groups making manually-assisted runs should report their queries, query features, and feature definitions.

  4. Every submitted run must contain a result set for each topic.

  5. One baseline run will be required of every manually-assisted system as well one for every automatic system

    • A run based only on the text from the English ASR/MT output provided by NIST and on the text of the topics.

  6. In order to maximize comparability within and across participating groups, all manually-assisted runs within any given site must be carried out by the same person.

  7. An interactive run will contain one result for each and every topic, each such result using the same system variant. Each result for a topic can come from only one searcher, but the same searcher does not need to be used for all topics in a run. Here are some suggestions for interactive experiments.

  8. The searcher should have no experience of the topics beyond the general world knowledge of an educated adult.

  9. The search system cannot be trained, pre-configured, or otherwise tuned to the topics.

  10. The maximum total elapsed time limit for each topic (from the time the searcher sees the topic until the time the final result set for that topic is returned) in an interactive search run will be 15 minutes. For manually-assisted runs the manual effort (topic to query translation) for any given topic will be limited to 15 minutes.

  11. All groups submitting search runs must include the actual elapsed time spent as defined in the videoSearchRunResult.dtd.

  12. Groups carrying out interactive runs should measure user characteristics and satisfaction as well and report this with their results, but they need not submit this information to NIST. Here is some information about the questionnaires the Dublin City University team used in 2004 to collect search feedback and demographics from all groups doing interactive searching. Something similar will be done again this year, with details to be determined once participation is known.

  13. In general, groups are reminded to use good experimental design principles. These include among other things, randomizing the order in which topics are searched for each run so as to balance learning effects.

  14. In general, groups are reminded to use good experimental design principles. These include among other things, randomizing the order in which topics are searched for each run so as to balance learning effects.

  15. Supplemental interactive search runs,i.e., runs which do not contribute to the pools but are evaluated by NIST, will again be allowed to enable groups to fill out an experimental design. Such runs must not be mixed in the same submission file with non-supplemental runs. This is the only sort of supplemental run that will be accepted.

2.4 Rushes exploitation:

Rushes are the raw material (extra video, B-rolls footage) used to produce a video. 20 to 40 times as much material may be shot as actually becomes part of the finished product. Rushes usually have only natural sound. Actors are only sometimes present. So very little if any information is encoded in speech. Rushes contain many frames or sequences of frames that are highly repetitive, e.g., many takes of the same scene redone due to errors (e.g. an actor gets his lines wrong, a plane flies over, etc.), long segments in which the camera is fixed on a given scene or barely moving,etc. A significant part of the material might qualify as stock footage - reusable shots of people, objects, events, locations, etc. Rushes may share some characteristics with "ground reconnaissance" video.

Groups participating in this task will develop and demonstrate at least a basic toolkit for support of exploratory search on highly redundant rushes data. The toolkit should include the ability to ingest video and subject it to whatever analysis your subsequent toolkit activities will require. The toolkit need not be a complete, polished, fully interactive product yet.

The minimal required goals of the toolkit are the ability to:

Groups may add addtional functionality as they are able.

Participants will be required to perform their own evaluation and present the results. No standard keyframes or shot boundaries will be provided as we would like to encourage innovation in approach.

The results of this exploratory task will provide input to planning for future TRECVID workshops about the feasiblity of shifting to work on unproduced video - both with respect to what systems can do and how able we are to evaluate them. Both involve difficult research issues.


3. Video data:

A number of MPEG-1 datasets are available for use in TRECVID 2005. We describe them here and then indicate below which data will be used for development versus test for each task.

Television news from November 2004

The Linguistic Data Consortium (LDC) collected the following video material and secured rights for research use. Note that although the amount of test data from sources used in 2006 will be as large or larger than in 2005, there will be significant additional data from channels and/or programs not included in the data for 2005. This is expected to provide interesting information on the extent to which detectors generalize.

TRECVID-use  Lang Source  Program                Hours

tv5 	tv6  Eng  NBC     NIGHTLYNEWS             9.0
tv5 	tv6  Eng  CNN     LIVEFROM               14.5

        tv6  Eng  CNN     COOPER                  8.3
 	tv6  Eng  MSN     NEWSLIVE               14.5 
------------------------------------------------------  46.3

tv5	tv6  Chi  CCTV4   DAILY_NEWS              9.2

	tv6  Chi  PHOENIX GOODMORNCN              7.5
	tv6  Chi  NTDTV   ECONFRNT                7.8
	tv6  Chi  NTDTV   FOCUSINT                5.2
------------------------------------------------------  29.7

tv5 	tv6  Ara  LBC     LBCNAHAR               35.3
tv5 	tv6  Ara  LBC     LBCNEWS	         39.5

	tv6  Ara  ALH     HURRA_NEWS              7.8
------------------------------------------------------  82.6
                                                      -----
                                                       158.6 hours
						       (136,026,556,416 bytes)

Rushes exploitation

About 50 hours of rushes has been provided by the BBC Archive.

3.1 Development versus test data

Groups that participated in 2005 should already have the 2005 video and common annotation on the 2005 development data as training data and will not get a new copy. New participants may request a copy of that data for training. It is expected that additional annotation of the 2005 development data will be donated by the LSCOM workshop for a set of 500 or more features on the 2005 development data, by the MediaMill team at the University of Amsterdam for 101 features as part of their larger proposal for donation of common discriminative baseline, and by the Centre for Digital Video Processing at Dublin City University for MPEG-7 XM features.

A random sample of about 7.5 hours will be removed from the 2006 television news data and the resulting data set used as shot boundary test data. The remaining hours of television news will be used as the test data for the search and high-level feature tasks. The partitioning of the rushes will be decided as part of the rushes task definition.

3.2 Data distribution

The shot boundary test data will be express shipped by NIST to participants on DVDs (DVD+R).

Distribution of all other development data and the remaining test data will be handled by LDC using 250GB loaner IDE drives using the ReiserFS. These must be returned or purchased within 3 weeks of loading on your system unless you have gotten approval for a delay from LDC in advance. The only charge to participants for test data will be the cost of shipping the drive(s) back to LDC. Please be sure to use a shipper who allows you to track the drive until it reaches LDC and your responsibility ends. More information about the data will be provided on the TRECVID website starting in March as we know more.

3.3 Ancillary data associated with the test data

Provided with the broadcast news test data (*.mpg) on the loaner drive will be a number of other datasets.

Output of an automatic speech recognition (ASR) system

The automatic speech recognition output we expect to provided will be the output of an off-the-shelf product but probably with no tuning to the TRECVID data. LDC will run provide the ASR output. Any mention of commercial products is for information only; it does not imply recommendation or endorsement by NIST.

Output of a machine translation system (X->English)

The machine translation output we expect to provide will be the output of an off-the-shelf product but probably with no tuning to the TRECVID data. BBN will provide the MT (Chinese/Arabic -> English) output in April.We will probably distribute this material by download if it does not arrive in time to be included on the hardrives. Any mention of commercial products is for information only; it does not imply recommendation or endorsement by NIST.

Common shot boundary reference and keyframes:

Christian Petersohn at the Fraunhofer (Heinrich Hertz) Institute in Berlin will once again provided the master shot reference. Please use the following reference in your papers:

C. Petersohn. "Fraunhofer HHI at TRECVID 2004:  Shot Boundary Detection System", 
TREC Video Retrieval Evaluation Online Proceedings, TRECVID, 2004
URL: www-nlpir.nist.gov/projects/tvpubs/tvpapers04/fraunhofer.pdf
The Dublin City University team will again format the reference and creating a common set of keyframes. Our thanks to both. The following paragraphs describe the method used in 2005 and to be repeated with the data for 2006.

To create the master list of shots, the video was segmented. The results of this pass are called subshots. Because the master shot reference is designed for use in manual assessment, a second pass over the segmentation was made to create the master shots of at least 2 seconds in length. These master shots are the ones to be used in submitting results for the feature and search tasks. In the second pass, starting at the beginning of each file, the subshots were aggregated, if necessary, until the currrent shot was at least 2 seconds in duration, at which point the aggregation began anew with the next subshot.

The keyframes were selected by going to the middle frame of the shot boundary, then parsing left and right of that frame to locate the nearest I-Frame. This then became the keyframe and was extracted. Keyframes have been provided at both the subshot (NRKF) and master shot RKF) levels.

In a small number of cases (all of them subshots) there was no I-Frame within the subshot boundaries. When this occured the middle frame was selected. (one anomally, at the end of the first video in the test collection, a subshot occurs outside a master shot.)

The emphasis in the common shot boundary reference will be on the shots, not the transitions. The shots are contiguous. There are no gaps between them. They do not overlap. The media time format is based on the Gregorian day time (ISO 8601) norm. Fractions are defined by counting pre-specified fractions of a second. In our case, the frame rate will likely be 29.97. One fraction of a second is thus specified as "PT1001N30000F".

The video id has the format of "XXX" and shot id "shotXXX_YYY". The "XXX" is the sequence number of video onto which the video file name is mapped, this will be listed in the "collection.xml" file. The "YYY" is the sequence number of the shot. Keyframes are identified as by a suffix "_RKF" for the main keyframe (one per shot) or "_NKRF" for additional keyframes derived from subshots that were merged so that shots have a minimum duration of 2 seconds.

The common shot boundary directory will contain these file(type)s:

3.5 Restrictions on use of development and test data

Each participating group is responsible for adhering to the letter and spirit of these rules, the intent of which is to make the TRECVID evaluation realistic, fair and maximally informative about system effectiveness as opposed to other confounding effects on performance. Submissions, which in the judgment of the coordinators and NIST do not comply, will not be accepted.

Test data

The test data shipped by LDC cannot be used for system development and system developers should have no knowledge of it until after they have submitted their results for evaluation to NIST. Depending on the size of the team and tasks undertaken, this may mean isolating certain team members from certain information or operations, freezing system development early, etc.

Participants may use donated feature extraction output from the test collection but incorporation of such features should be automatic so that system development is not affected by knowledge of the extracted features. Anyone doing searches must be isolated from knowledge of that output.

Participants cannot use the knowledge that the test collection comes from news video recorded during a known time period in the development of their systems. This would be unrealistic.

Development data

The development data is intended for the participants' use in developing their systems. It is up to the participants how the development data is used, e.g., divided into training and validation data, etc.

Other data sets created by LDC for earlier evaluations and derived from the same original videos as the test data cannot be used in developing systems for TRECVID 2006.

If participants use the output of an ASR/MT system, they must submit at least one run using the English ASR/MT provided by NIST. They are free to use the output of other ASR/MT systems in additional runs.

Participants may use other development resources not excluded in these guidelines. Such resources should be reported at the workshop. Note that use of other resources will change the submission's status with respect to system development type, which is described next.

In order to help isolate system development as a factor in system performance each feature extraction task submission, search task submission, or donation of extracted features must declare its type:

3.6 Data license agreements for active participants

In order to be eligible to receive the test data, you must have have applied for participation in TRECVID, be acknowledged as an active participant, have completed the relevant permission forms (from the active participant's area) and faxed them (Attention: Lori Buckland) to fax number in the US. Include a cover sheet with your fax that identifies you, your organization, your email address, and the fact that you are requesting the TRECVID 2005 and/or 2006 data.


4. Topics:

4.1 Example types of video needs

I'm interested in video material containing:

Topics may target commercials as well as news content.

4.2 Topics:

The topics, formatted multimedia statements of information need, will be developed by NIST who will control their distribution. The topics will express the need for video concerning people, things, events, locations, etc. and combinations of the former. Candidate topics (text only) will be created at NIST by examining a large subset of the test collection videos without reference to the audio, looking for candidate topic targets. Note: Following the VACE III goals, topics asking for video of events will be much more frequent this year - exploring the limits of one-keyframe-per-shot approaches for this kind of topic and encouraging exploration beyond those limits. Accepted topics will be enhanced with non-textual examples from the Web if possible and from the development data if need be. The goal is to create 24 topics.

* Note: The identification of any commercial product or trade name does not imply endorsement or recommendation by the National Institute of Standards and Technology


5. Submissions and Evaluations:

Please note: Only submissions which are valid when checked against the supplied DTDs will be accepted. You must check your submission before submitting it. NIST reserves the right to reject any submission which does not parse correctly against the provided DTD(s). Various checkers exist, e.g., the one at Brown University, Xerces-J,, etc.

The results of the evaluation will be made available to attendees at the TRECVID workshop and will be published in the final proceedings and/or on the TRECVID website within six months after the workshop. All submissions will likewise be available to interested researchers via the TRECVID website within six months of the workshop.

5.1 Shot boundary detection

5.2 High-level feature extraction

Submissions
Evaluation

5.3 Search

Submissions
Evaluation

5.5 Rushes exploitation

This is an exploratory task in 2005 and participants will perform their own evaluations. There will be no submissions. Results should be presented in the notebook paper and in a demonstration, poster, or possibly a talk at the workshop.


6. Milestones:

The following are the target dates for 2006.

1. Feb
NIST sends out Call for Participation in TRECVID 2006
20. Feb
Applications for participation in TRECVID 2006 due at NIST
  1 Mar
Final versions of TRECVID 2005 papers due at NIST
15. Mar
LDC begins shipping 2005 data to new participants for use in training
  1. Apr
BBC rushes proposal complete
Guidelines complete
18. Apr
LDC begins shipping hard drives with 2006 data to all participants
25. Apr
ASR/MT output for feature/search test data available for download
14. Jul
Shot boundary test collection DVDs shipped by NIST
11. Aug
Search topics available from TRECVID website.
15. Aug
Shot boundary detection submissions due at NIST for evaluation.
21. Aug
Feature extraction tasks submissions due at NIST for evaluation.
Feature extraction donations due at NIST
25. Aug
Feature extraction donations available for active participants
25 Aug
Results of shot boundary evaluations returned to participants
29. Aug - 13. Oct
Search and feature assessment at NIST
19. Sep
Results of feature extraction evaluations returned to participants
15. Sep
Search task submissions due at NIST for evaluation
18. Oct
Results of search evaluations returned to participants
23. Oct
Speaker proposals due at NIST
29. Oct
Notebook papers due at NIST
  6. Nov
Workshop registration closes
  8. Nov
Copyright forms due back at NIST (see Notebook papers for instructions)
13,14 Nov
TRECVID Workshop at NIST in Gaithersburg, MD(Registration, agenda, etc)
30. Nov
Workshop papers publicly available (slides added as they arrive)
  1. Mar 2007
Final versions of TRECVID 2006 papers due at NIST

7. Outstanding 2006 guideline work items

Here is a list of work items that must be completed before the guidelines are considered to be final and the responsible parties.


8. Results and submissions

For future use ...


9. Information for active participants



10 Contacts:


National Institute of
Standards and Technology Home Last updated: Tuesday, 16-Jan-2007 18:50:56 UTC
Date created: Monday, 24-Jan-06
For further information contact