Guidelines for the TRECVID 2008 Evaluation

(last updated: Monday, 05-Jan-2009 13:30:36 EST)

0. Table of Contents:


1. Introduction:

The main goal of the TREC Video Retrieval Evaluation (TRECVID) is to promote progress in content-based analysis of and retrieval from digital video via open, metrics-based evaluation. TRECVID is a laboratory-style evaluation that attempts to model real world situations or significant component tasks involved in such situations.

In 2006 TRECVID completed the second two-year cycle devoted to automatic segmentation, indexing, and content-based retrieval of digital video - broadcast news in English, Arabic, and Chinese. It also completed two years of pilot studies on exploitation of unedited video (rushes). Some 70 research groups have been provided with the TRECVID 2005-2006 broadcast news video and many resources created by NIST and the TRECVID community are available for continued research on this data independent of TRECVID. See the "Past data" section of the TRECVID website for pointers.

In 2007 TRECVID began exploring new data (cultural, news magazine, documentary, and education programming) and an additional, new task - video rushes summarization. In 2008 that work will continue with the exception of the shot boundary detection task, which will be retired. In addition TRECVID plans to organize two new task evaluations.

TRECVID 2008 will test systems on the following tasks:

For past participants, here are some changes to note:

  1. We expect to increase the number of topics for automatic search runs to ~ 50. Manual and interactive runs will use a subset of 24 of the 50. For automatic runs only, this will entail evaluating all search runs using a 50% sample of the pooled submissions and inferred average precision rather than average precision - as has been done for the feature task since 2006.
  2. The upper limit for the duration of each interactive or manual search will be reduced from 15 minutes to 10 minutes.
  3. The upper limit for video summaries will be reduced from 4% of the duration of the full video to 2%.
  4. TRECVID will continue to emphasize search for events (object+action) not easily captured in a single frame as opposed to searching for static objects.
  5. While mastershots will be defined as units of evaluation, keyframes or annotation of keyframes will not be provided by NIST. This will require groups to look afresh at how best to train their systems - tradeoffs between processing speed, effectiveness, amount of the video processed. As in the past, participants may want to team up to create training resources.
  6. The degree to which systems trained on broadcast news generalize with varying amounts of training data to a related but different genre will be a focus of TRECVID 2008.

2. Video data:

A number of datasets are available for use in TRECVID 2008. We describe them here and then indicate below which data will be used for development versus test for each task.

Sound and Vision

BBC rushes

TRECVID 2008 surveillance video

Further information about the data is available here.

MUSCLE-VCD-2007


3. Data license agreements for active participants

In order to be eligible to receive the data, you must have have applied for participation in TRECVID. Your application will be acknowledged by NIST with information about how to obtain the data. Then you will need to complete the relevant permission forms (from the active participant's area) and fax them (Attention: Lori Buckland) to fax
number in the US. Include a cover sheet with your fax that identifies you, your organization, your email address, and each kind of data you are requesting. Alternatively you may email a well-identified pdf of each signed form to Please ask only for the test data (and optional development data) required for the task(s) you apply to participate in and intend to complete. One permission form will cover 2007 and 2008 BBC data. One permission form will cover 2007 and 2008 Sound and Vision data.


4. System task details:

4.1 Surveillance event detection:

The guidelines for this task have been developed with input from the research community. Given 100 hours of surveillance video (50 hours training, 50 hours test) the task is to detect 3 or more events from the required event set and identify their occurrences temporally. Systems can make multiple passes before outputting a list of putative event observations (i.e., this is a retrospective detection task). Besides the retrospective task, participants may alternatively choose to do a "free style" analysis of the data. Further information about the tasks may be found at the following web sites:

4.2 High-level feature extraction:

Various high-level semantic features, concepts such as "Indoor/Outdoor", "People", "Speech" etc., occur frequently in video databases. The proposed task will contribute to work on a benchmark for evaluating the effectiveness of detection methods for semantic concepts

The task is as follows: given the feature test collection, the common shot boundary reference for the feature extraction test collection, and the list of feature definitions (see below), participants will return for each feature the list of at most 2000 shots from the test collection, ranked according to the highest possibility of detecting the presence of the feature. Each feature is assumed to be binary, i.e., it is either present or absent in the given reference shot.

All feature detection submissions will be made available to all participants for use in the search task - unless the submitter explicitly asks NIST before submission not to do this.

Description of high-level features to be detected:

The descriptions are those used in the common annotation effort. They are meant for humans, e.g., assessors/annotators creating truth data and system developers attempting to automate feature detection. They are not meant to indicate how automatic detection should be achieved.

If the feature is true for some frame (sequence) within the shot, then it is true for the shot; and vice versa. This is a simplification adopted for the benefits it affords in pooling of results and approximating the basis for calculating recall.

NOTE: In the following, "contains x" is short for "contains x to a degree sufficient for x to be recognizable as x to a human" . This means among other things that unless explicitly stated, partial visibility or audibility may suffice.

NOTE: NIST will instruct the assessors during the manual evaluation of the feature task submissions as follows. The fact that a segment contains video of physical objects representing the topic target, such as photos, paintings, models, or toy versions of the topic target, should NOT be grounds for judging the feature to be true for the segment. Containing video of the target within video may be grounds for doing so.

Selection of high-level features to be detected:

In 2008, participants in the high-level feature task must submit results for all 20 of the following features. NIST will then choose 10-20 of the features and evaluate submissions for those. The features were drawn from the large LSCOM feature set so as to be appropriate to the Sound and Vision data used in the feature and search tasks. Some feature definitions were enhanced for greater clarity, so it is important that the TRECVID feature descriptions be used and not the LSCOM descriptions.

Here is the final list of features for evaluation together with their brief descriptions and some general rules of interpretation

4.3 Search:

Search is high-level task which includes at least query-based retrieval and browsing. The search task models that of an intelligence analyst or analogous worker, who is looking for segments of video containing persons, objects, events, locations, etc. of interest. These persons, objects, etc. may be peripheral or accidental to the original subject of the video. The task is as follows: given the search test collection, a multimedia statement of information need (topic), and the common shot boundary reference for the search test collection, return a ranked list of at most 1000 common reference shots from the test collection, which best satisfy the need. Please note the following restrictions for this task:

  1. TRECVID 2008 will accept fully automatic search submissions (no human input in the loop) as well as manually-assisted and interactive submissions as illustrated graphically below

  2. graphic description of run types

  3. Because the choice of features and their combination for search is an open research question, no attempt will be made to restrict groups with respect to their use of features in search. However, groups making manually-assisted runs should report their queries, query features, and feature definitions.

  4. Every submitted run must contain a result set for each topic.

  5. One baseline run will be required of every manually-assisted system as well one for every automatic system


  6. In order to maximize comparability within and across participating groups, all manually-assisted runs within any given site must be carried out by the same person.

  7. An interactive run will contain one result for each and every topic, each such result using the same system variant. Each result for a topic can come from only one searcher, but the same searcher does not need to be used for all topics in a run. Here are some suggestions for interactive experiments.

  8. The searcher should have no experience of the topics beyond the general world knowledge of an educated adult.

  9. The search system cannot be trained, pre-configured, or otherwise tuned to the topics.

  10. The maximum total elapsed time limit for each topic (from the time the searcher sees the topic until the time the final result set for that topic is returned) in an interactive search run will be 10 minutes. For manually-assisted runs the manual effort (topic to query translation) for any given topic will be limited to 10 minutes.

  11. All groups submitting search runs must include the actual elapsed time spent as defined in the videoSearchRunResult.dtd.

  12. Groups carrying out interactive runs should measure user characteristics and satisfaction as well and report this with their results, but they need not submit this information to NIST. Here is some information about the questionnaires the Dublin City University team used in 2004 to collect search feedback and demographics from all groups doing interactive searching. Something similar will be done again this year, with details to be determined once participation is known.

  13. In general, groups are reminded to use good experimental design principles. These include among other things, randomizing the order in which topics are searched for each run so as to balance learning effects.

  14. Supplemental interactive search runs, i.e., runs which do not contribute to the pools but are evaluated by NIST, will be allowed to enable groups to fill out an experimental design. Such runs must not be mixed in the same submission file with non-supplemental runs. This is the only sort of supplemental run that will be accepted.

4.4 Rushes summarization:

Rushes are the raw material (extra video, B-rolls footage) used to produce a video. 20 to 40 times as much material may be shot as actually becomes part of the finished product. Rushes usually have only natural sound. Actors are only sometimes present. So very little if any information is encoded in speech. Rushes contain many frames or sequences of frames that are highly repetitive, e.g., many takes of the same scene redone due to errors (e.g. an actor gets his lines wrong, a plane flies over, etc.), long segments in which the camera is fixed on a given scene or barely moving,etc. A significant part of the material might qualify as stock footage - reusable shots of people, objects, events, locations, etc. Rushes may share some characteristics with "ground reconnaissance" video.

The system task in rushes summarization will be, given a video from the rushes test collection, to automatically create an MPEG-1 summary clip less than or equal to a maximum duration (to be determined) that shows the main objects (animate and inanimate) and events in the rushes video to be summarized. The summary should minimize the number of frames used and present the information in ways that maximizes the usability of the summary and speed of objects/event recognition.

Such a summary could be returned with each video found by a video search engine much text search engines return short lists of keywords (in context) for each document found - to help the searcher decide whether to explore a given item further without viewing the whole item. It might be input to a larger system for filtering, exploring and managing rushes data.

Although in this task we limit the notion of visual summary to a single clip that will be evaluated using simple play and pause controls, there is still room for creativity in generating the summary. Summaries need not be series of frames taken directly from the video to be summarized and presented in the same order. Summaries can contain picture-in-picture, split screens, and results of other techniques for organizing the summary. Such approaches will raise interesting questions of usability.

The summarization of BBC rushes will be run as a workshop at the ACM Multimedia Conference in Vancouver, Canada during the last week of October 2008.

4.5 Content-based copy detection:

As used here, a copy is a segment of video derived from another video, usually by means of various transformations such as addition, deletion, modification (of aspect, color, contrast, encoding, ...), camcording, etc. Detecting copies is important for copyright control, business intelligence and advertisment tracking, law enforcement investigations, etc. Content-based copy detection offers an alternative to watermarking. The TRECVID copy detection task will be carried out in collaboration with members of the IMEDIA team at INRIA and will build on work demonstrated at CIVR 2007.

Required task

The required system task will be as follows: given a test collection of videos and a set of about 2000 queries (video-only segments), determine for each query the place, if any, that some part of the query occurs, with possible transformations, in the test collection. The set of possible transformations will be based to the extent possible on actually occurring transformations.

Each query will be constructed using tools developed by IMEDIA to include some randomization at various decision points in the construction of the query set. For each query, the tools will take a segment from the test collection, optionally transform it, embed it in some video segment which does not occur in the test collection, and then finally apply one or more transformations to the entire query segment. Some queries may contain no test segment; others may be composed entirely of the test segment. Video transformations to be used are documented in the general plan for query creation. and in the final video transformations document with examples..

Optional tasks

Videos often contain audio. Sometimes the original audio is retained in the copied material, sometimes it is replaced by a new soundtrack. Nevertheless, audio is an important and strong feature for some application scenarios of video copy detection. Since detection of untransformed audio copies is relatively easy, and the primary interest of the TV community is in video analysis, it was decided to model the required CD task with video-only queries. However, since audio is of importance for practical applications, there will be two additional optional tasks: a task using transformed audio-only queries and one using transformed audio+video queries.

The audio-only queries will be generated along the same lines as the video-only queries: a set of 201 base audio-only queries is transformed by several techniques that are intended to be typical of those that would occur in real reuse scenarios: (1) bandwidth limitation (2) other coding-related distortion (e.g. subband quantization noise) (3) variable mixing with unrelated audio content. The transformed queries will be downloadable from NIST.

The audio+video queries will consist of the aligned versions of transformed audio and video queries, i.e, they will be various combinations of transformed audio and transformed video from a given base audio+video query. In this way sites can study the effectiveness of their systems for individual audio and video transformations and their combinations. These queries will not be downloadable. Rather, NIST will provide a list of how to construct each audio+video test query so that given the audio-only queries and the video-only queries, sites can use a tool such as ffmpeg to construct the audio+video queries.

Please watch the schedule for information soon about the sequence of query releases and results due dates.


5. Submissions and Evaluations:

Please note: Only submissions which are valid when checked against the supplied DTDs will be accepted. You must check your submission before submitting it. NIST reserves the right to reject any submission which does not parse correctly against the provided DTD(s). Various checkers exist, e.g., Xerces-J: java sax.SAXCount -v YourSubmision.xml.

The results of the evaluation will be made available to attendees at the TRECVID workshop and will be published in the final proceedings and/or on the TRECVID website within six months after the workshop. All submissions will likewise be available to interested researchers via the TRECVID website within six months of the workshop.

5.1 Surveillance event detection pilot

Submissions

The guidelines for submission are currently being developed.

Further information on submissions may be found here.

Evaluation

Output from systems will first be aligned to ground truth annotations, then scored for misses / false alarms. Since error is a tradeoff between probability of miss vs. rate of false alarms, this task will use the Normalized Detection Cost Rate (NDCR) measure for evaluating system performance. NDCR is a weighted linear combination of the system's Missed Detection Probability and False Alarm Rate (measured per unit time).

Further information about the evaluation measures may be found here.

5.2 High-level feature extraction

Submissions
Evaluation

5.3 Search

Submissions
Evaluation

5.4 Rushes summarization

Submissions
Evaluation

Carnegie Mellon University will again provide a simple baseline system to produce summaries within the 2% maximum. The baseline algorithm simply presents the entire video at 50x normal speed.

5.5 Content-based copy detection pilot

Submissions
Evaluation

6. Schedule:

The following are the target dates for 2008.

The schedule for the surveillance event detection task listed at the end of this document .

Just below is the proposed schedule for work on the BBC rushes
summarization task that will be held as a workshop at the ACM
Multimedia Conference in Vancouver, Canada during the last week of
October 2008. Results will be summarized at the TRECVID workshop in
November. Papers reporting participants' summarization that are not
included in the ACM Multimedia Worhshop proceedings should be
submitted for inclusioni in the TRECVID workshop notebook.

  1   Apr  test data available for download
  5   May  system output submitted to NIST for judging at DCU
  1   Jun  evaluation results distributed to participants
 28   Jun  papers (max 5 pages) due in ACM format 
           The organizers will provide an intro paper with information 
           about the data, task, groundtruthing, evaluation, measures, etc.
 15   Jul  acceptance notification  
  1   Aug  camera-ready papers due via ACM process
 31   Oct  video summarization workshop at ACM Multimedia '08, Vancouver, BC, Canada
  1. Feb
NIST sends out Call for Participation in TRECVID 2008
22. Feb
Applications for participation in TRECVID 2008 due at NIST
  1 Mar
Final versions of TRECVID 2007 papers due at NIST
  1. Apr
Guidelines complete
11. Apr
Extended participant decision deadline for event detection task
  April
Download of feature/search development data
  June
Download of feature/search test data
30. June
Video-only copy detection queries available for download
  1. Aug
Video-only copy detection submissions due at NIST for evaluation
Audio-only copy detection queries avilable for download
  8. Aug
Search topics available from TRECVID website.
15. Aug
Feature extraction tasks submissions due at NIST for evaluation.
Feature extraction donations due at NIST
22. Aug
Feature extraction donations available for active participants
25. Aug - 12. Sep
Feature assessment at NIST
29. Aug
Audio-only copy detection submissions due at NIST
Audio+video copy detection query plans available for download
12. Sep
Search task submissions due at NIST for evaluation
19. Sep
Results of feature extraction evaluations returned to participants
22. Sep - 10. Oct
Search assessment at NIST
  1. Oct
Audio+video copy detection submissions due at NIST for evaluation
Video-only and audio-only copy detection results returned to participants
  9. Oct
Audio+video copy detection results returned to participants
TRECVID workshop registration opens
15. Oct
Results of search evaluations returned to participants
15. Oct
Results of search evaluations returned to participants
20. Oct
Speaker proposals due at NIST
27. Oct
Notebook papers due at NIST
  1. Nov
Copyright forms due back at NIST (see Notebook papers for instructions)
10. Nov
TRECVID 2008 Workshop registration closes
17,18 Nov
TRECVID Workshop at NIST in Gaithersburg, MD
15. Dec
Workshop papers publicly available (slides added as they arrive)
  1. Mar 2009
Final versions of TRECVID 2008 papers due at NIST

7. Outstanding 2008 guideline work items

Here is a list of work items that must be completed before the guidelines are considered to be final..


8. Information for active participants



9. Contacts:


National Institute of
Standards and Technology Home Last updated: Monday, 05-Jan-2009 13:30:36 EST
Date created: Tuesday, 3-Dec-07
For further information contact