Guidelines for the TRECVID 2003 Evaluation

(last updated: Tuesday, 15-Jun-2004 09:43:10 EDT)

1. Goal:

The main goal of the TREC Video Retrieval Evaluation (TRECVID) is to promote progress in content-based retrieval from digital video via open, metrics-based evaluation.

2. Tasks:

TRECVID is a laboratory-style evaluation that attempts to model real world situations or significant component tasks involved in such situations.

There are four main tasks with tests associated and participants must complete at least one of these in order to attend the workshop.

2.1 Shot detection:

The task is as follows: identify the shot boundaries with their location and type (cut or gradual) in the given video clip(s)

2.2 Story segmentation:

The task is as follows: given the story boundary test collection, identify the story boundaries with their location (time) and type (miscellaneous or news) in the given video clip(s). This is a new task for 2003.

A story can be composed of multiple shots, e.g. an anchorperson introduces a reporter and the story is finished back in the studio-setting. On the other hand, a single shot can contain story boundaries, e.g. an anchorperson switching to the next news topic.

The task is based on manual story boundary annotations made by LDC for the TDT-2 project. Therefore, LDC's definition of a story will be used in the task: A news story is defined as a segment of a news broadcast with a coherent news focus which contains at least two independent, declarative clauses. Other coherent segments are labeled as miscellaneous. These non-news stories cover a mixture of footage: commercials, lead-ins and reporter chit-chat. Guidelines that were used for annotating the TDT-2 dataset are available at Other useful documents are the guidelines document for the annotation of the TDT4 corpus and a similar document on TDT3, which discuss the annotation guidelines for the different corpora. Section 2 in the TDT4 document is of particular interest for the story segmentation task.
Note: adjacent non-news stories are merged together and annotated as one single story classified as "miscellaneous".

Differences with the TDT-2 story segmentation task:

  1. TRECVID 2003 uses a subset of TDT2 dataset: only video sources.
  2. Video stream is available to enhance story segmentation.
  3. The task is modeled as a retrospective action, so it is allowed to use global data.
  4. TRECVID 2003 has a story classification task (which is optional).
There are several required and recommended runs:
  1. Required: Video + Audio (no ASR/CC)
  2. Required: Video + Audio + ASR
  3. Required: ASR (no Video + Audio)
  4. The ASR in the required and recommended runs is the ASR provided by LIMSI. We have dropped the use of the CC data on the hard drive and adopted use the LIMSI ASR rather than that provided on the hard drive because the LIMSI ASR is based on the MPEG-1 version of the video and requires no alignment. Additional runs can use other ASR systems.
  5. It is recommended that story segmentation runs are complemented with story classification.

With TRECVID 2003's story segmentation task, we hope to show how video information can enhance story segmentation algorithms.

2.3 Feature extraction:

Various high-level semantic features, concepts such as "Indoor/Outdoor", "People", "Speech" etc., occur frequently in video databases. The proposed task will contribute to work on a benchmark for evaluating the effectiveness of detection methods for semantic concepts

The task is as follows: given the feature test collection, the common shot boundary reference for the feature extraction test collection, and the list of feature definitions(see below), participants will return for each feature the list of at most 2000 shots from the test collection, ranked according to the highest possibility of detecting the presence of the feature. Each feature is assumed to be binary, i.e., it is either present or absent in the given reference shot.

Participants are encouraged to make their feature detection submission available to other participants for use in the search task. Donors should provide the donated detection over the search test collection in the feature exchange format by the date indicated in the schedule below..

Description of features to be detected:

These descriptions are meant to be clear to humans, e.g., assessors/annotators creating truth data and system developers attempting to automate feature detection. They are not meant to indicate how automatic detection should be achieved.

If the feature is true for some frame (sequence) within the shot, then it is true for the shot; and vice versa. This is a simplifaction adopted for the benefits it affords in pooling of results and approximating the basis for calculating recall.

NOTE: In the following, "contains x" is short for "contains x to a degree sufficient for x to be recognizable as x to a human" . This means among other things that unless explicitly stated, partial visibility or audibility may suffice.

  1. Outdoors: segment contains a recognizably outdoor location, i.e., one outside of buildings. Should exclude all scenes that are indoors or are close-ups of objects (even if the objects are outdoors).

  2. News subject face: segment contains the face of at least one human news subject. The face must be of someone who is not an anchor person, news reporter, correspondent, commentator, news analyst, nor other sort of news person.

  3. People: segment contains at least THREE humans.

  4. Building: segment contains a building. Buildings are walled structures with a roof.

  5. Road: segment contains part of a road - any size, paved or not.

  6. Vegetation: segment contains living vegetation in its natural environment

  7. Animal: segment contains an animal other than a human

  8. Female speech: segment contains a female human voice uttering words during and the speaker is visible.

  9. Car/truck/bus: segment contains at least one automobile, truck, or bus exterior.

  10. Aircraft: segment contains at least one aircraft of any sort.

  11. News subject monologue: segment contains an event in which a single person, a news subject not a news person, speaks for a long time without interruption by another speaker. Pauses are ok if short.

  12. Non-studio setting: segment is not set in a tv broadcast studio

  13. Sporting event: segment contains video of one or more organized sporting events

  14. Weather news: segment reports on the weather

  15. Zoom in: camera zooms in during the segment

  16. Physical violence: segment contains violent interaction between people and/or objects

  17. Person x: segment contains video of person x (x = Madeleine Albright)

2.4 Search:

The task is as follows: given the search test collection, a multimedia statement of information need (topic), and the common shot boundary reference for the search test collection, return a ranked list of at most 1000 common reference shots from the test collection, which best satisfy the need. Please note the following restrictions for this task:

  1. TRECVID 2003 will set aside the challenging problem of fully automatic topic analysis and query generation. Submissions will be restricted to those with a human in the loop, i.e., manual or interactive runs as defined below.

  2. graphic description of run types

  3. Because the choice of features and their combination for search is an open research question, no attempt will be made to restrict groups with respect to their use of features in search. However, groups making manual runs should report their queries, query features, and feature definitions.

  4. Every submitted run must contain a result set for each topic.

  5. One baseline run will be required of every manual system:

    1. A run based only on the text from the LIMSI ASR output and on the text of the topics.

  6. In order to maximize comparability within and across participating groups, all manual runs within any given site must be carried out by the same person.

  7. An interactive run will contain one result for each and every topic, each such result using the same system variant. Each result for a topic can come from only one searcher, but the same searcher does not need to be used for all topics in a run. Here are some suggestions for interactive experiments.

  8. The searcher should have no experience of the topics beyond the general world knowledge of an educated adult.

  9. The search system cannot be trained, pre-configured, or otherwise tuned to the topics.

  10. The maximum total elapsed time limit for each topic (from the time the searcher sees the topic until the time the final result set for that topic is returned) in an interactive search run will be 15 minutes. For manual runs the manual effort (topic to query translation) for any given topic will be limited to 15 minutes.

  11. All groups submitting search runs must include the actual elapsed time spent as defined in the videoSearchRunResult.dtd.

  12. Groups carrying out interactive runs are encouraged to measure user characteristics and satisfaction as well and report this with their results, but they need not submit this information to NIST. Here are some examples of instruments for collection of user characteristics/satisfaction data developed and used by the TREC Interactive Track for several years.

  13. In general, groups are reminded to use good experimental design principles. These include among other things, randomizing the order in which topics are searched for each run so as to balance learning effects.

3. Video data:

NOTE: TRECVID 2003 is now over. Unless indicated, the 2003 test and development data is fully available only to TRECVID participants. This includes the basic MPEG-1 files, and derived files such as ASR, story segmentation, and transcript files. LDC may make some of the data generally available.

As you will see below, other data such as topics, feature donations, the results of the collaborative annotation, and various sorts of truth data created by/for NIST is freely available from this page.


The total identified collection comprises

Associated textual data (with associated file extensions)

Provided with the ABC/CNN MPEG-1 data (*.mpg) will be the output of an automatic speech recognition system (*.as1) and a closed-captions-based transcript. The transcript will be available in two forms: simple tokens (*.tkn) with no other information for the development and test data; tokens grouped into stories (*.src_sgm) with story start times and type for the development collection. The times in the ASR and transcript data are based on the analogue version of the video and so are offset from the current MPEG-1 digital version.

LDC has provided alignment tables so that the old times can be used with the new video. Here is a table for the development collection based on LDC's manual examination of three points in each file.

Additional ASR output from LIMSI-CNRS:

Jean-Luc Gauvain of the Spoken Language Processing Group at LIMSI has graciously donated ASR output for the entire collection Be sure to credit them for this contribution by a non-participant.

   J.L. Gauvain, L. Lamel, and G. Adda.
   The LIMSI Broadcast News Transcription System.
   Speech Communication, 37(1-2):89-108, 2002.

Development versus test data

About 6 hours of data were selected from the total collection to be used solely as the shot boundary test collection.

The remainder was sorted more or less chronologically (C-SPAN covers a slighly different period than the ABC/CNN data). The first half was designated the feature / search / story segmentation development collection. The second is the feature / search / story segmentation test collection. Note that the story segmentation task will not use the C-SPAN files for development or test.

All of the development and test data with the exception of the shot boundary test data will be shipped by the Linguistic Data Consortium (LDC) on an IDE hard disk to each participating site at no cost to the participants. Each such site will need to offload the data onto local storage and pay to return the disk to LDC. The size of data on the hardrive will be a little over 100 gigbytes. The shot boundary test data (~ 5 gigabytes) will be shipped by NIST to participants on DVDs (DVD+R).

Restrictions on use of development and test data

Each participating group is responsible for adhering to the letter and spirit of these rules, the intent of which is to make the TRECVID evaluation realsitic, fair and maximally informative about system effectiveness as opposed to other confounding effects on performance. Submissions, which in the judgment of the coordinators and NIST do not comply, will not be accepted.

Test data

The test data shipped by LDC cannot be used for system development and system developers should have no knowledge of it until after they have submitted their results for evaluation to NIST. Depending on the size of the team and tasks undertaken, this may mean isolating certain team members from certain information or operations, freezing system development early, etc.

Participants may use donated feature extraction output from the test collection but incorporation of such features should be automatic so that system development is not affected by knowledge of the extracted features. Anyone doing searches must be isolated from knowledge of that output.

Participants cannot use the knowledge that the test collection comes from news video recorded during the first half of 1998 in the development of their systems. This would be unrealistic.

Development data

The development data shipped by LDC is intended for the participants' use in developing their systems. It is up to the participants how the development data is used, e.g., divided into training and validation data, etc.

Other data sets created by LDC for earlier evaluations and derived from the same original videos as the test data cannot be used in developing systems for TRECVID 2003.

If participants use the output of an ASR system, they must submit at least one run using that provided on the loaner drive from LDC. They are free to use the output of other ASR systems in additional runs.

If participants use a closed-captions-based transcript, they must use only that provided on the loaner drive from LDC.

Participants may use other development resources not excluded in these guidelines. Such resources should be reported at the workshop. Note that use of other resources will change the submission's status with respect to system development type, which is described next.

There is a group of participants creating and sharing annotation of the development data. See the Video Collaborative Annotation Forum webpage for details. Here is the set of collaborative annotations created for TRECVID 2003.

In order to help isolate system development as a factor in system performance each feature extraction task submission, search task submission, or donation of extracted features must declare its type:

3.1 Common shot boundary reference and keyframes:

A common shot boundary reference has again kindly been provided by Georges Quenot at CLIPS-IMAG. Keyframes have also been selected for use in the search and feature extraction tasks. NIST can provide the keyframes on DVD+R with some delay to participating groups unable to extract the keyframes themselves.

The emphasis in the common shot boundary reference will be on the shots, not the transitions. The shots are contiguous. There are no gaps between them. They do not overlap. The media time format is based on the Gregorian day time (ISO 8601) norm. Fractions are defined by counting pre-specified fractions of a second. In our case, the frame rate will likely be 29.97. One fraction of a second is thus specified as "PT1001N30000F".

The video id has the format of "XXX" and shot id "shotXXX_YYY". The "XXX" is the sequence number of video onto which the video file name is mapped, this will be listed in the "collection.xml" file. The "YYY" is the sequence number of the shot. Keyframes are identified as by a suffix "_RKF" for the main keyframe (one per shot) or "_NKRF" for additional keyframes derived from subshots that were merged so that shots have a minimum duration of 2 seconcds.

The common shot boundary directory contains these file(type)s:

4. Information needs and topics:

4.1 Example types of informations needs

I'm interested in video material / information about:

As an experiment, NIST may create a topic of the form "I'm looking for video that tells me the name of the person/place/thing/event in the image/video example"

Topics may target commercials as well as news content.

4.2 Topics:

The topics, formatted multimedia statements of information need, will be developed by NIST who will control their distribution. The topics will express the need for video concerning people, things, events, locations, etc. and combinations of the former. Candidate topics (text only) will be created at NIST by mining various news sources from the time period of the test collection and a log of actual queries logged and provided by the BBC. The test collection will then be examined to see how frequent relevant shots occur for each topic. Each topic will then either be accepted or rejected. Accepted topics will be enhanced with non-textual examples from the Web if possible and from the development data if need be. Current plans are to use an InforMedia* client as part of the testing to see that some relevant shots occur in the test collection. The goal is to create 25 topics.

* Note: The identification of any commercial product or trade name does not imply endorsement or recommendation by the National Institute of Standards and Technology

5. Submissions and Evaluation:

Please note: Only submissions which are valid when checked against the supplied DTDs will be accepted. You must check your submission. Various checkers exist, e.g., the one at Brown University, Xerces-J, etc.

The results of the evaluation will be made available to attendees at the TRECVID 2003 workshop and will be published in the final proceedings and/or on the TRECVID website within six months after the workshop. All submissions will likewise be available to interested researchers via the TRECVID website within six months of the workshop.

5.1 Shot boundary detection

5.2 Feature extraction


5.3 Story segmentation

Comparability with TDT-2 Results

Results of the TRECVID 2003 story segmentation task cannot be directly compared to TDT-2 results because the evaluation datasets differ and different evaluation measures are used. TRECVID 2003 participants have shown a preference for a precision/recall oriented evaluation, whereas TDT used (and is still using) normalized detection cost. Finally, TDT was modeled as an on-line task, whereas TRECVID examines story segmentation in an archival setting, permitting the use of global information. However, the TRECVID 2003 story segmentation task provides an interesting testbed for cross-resource experiments. In principle, a TDT system could be used to produce an ASR+CC or ASR+CC+Audio run.

5.4 Search


6. Milestones:

10 Jun
Video data express-shipped to participants by LDC
11 Jun
Guidelines complete
16? Jun
Common shot reference and key frames available
30 Jun
Common annotation complete
18 Jul
Shot boundary test collection DVDs shipped by NIST
15 Aug
Search topics available from TRECVID website to active participants.
The needed clips from the MPEG-1 videos used as examples in the topics (to save downloading the entire example videos) are available from the team at Dublin City University. The file is under 17 Megabytes as opposed to about 1 Gigabyte for the complete files in which the clips are contained. In some cases the files may contain a frame or two more than specified in the topic description, but participants can further trim the clips if necessary.
17 Aug
Shot boundary detection submissions due at NIST for evaluation.
22 Aug
Feature extraction task submissions due at NIST for evaluation.
Feature extraction donations due at NIST
25 Aug
Feature extraction donations available for active participants
  7 Sep
Story segmentation/typing submissions due at NIST for evaluation
12 Sep
Results of shot boundary evaluations returned to participants
24 Sep
Search task submissions due at NIST for evaluation
24 Sep - 15 Oct
Search and feature assessment at NIST
10 Oct
Results of story segmentation evaluations returned to participants
17 Oct
Results of search and feature extraction evaluations returned to participants
24 Oct
Speaker proposals due at NIST (Instructions)
  2 Nov
Notebook papers due at NIST (Instructions)
  3 Nov
Workshop registration closes
17-18 Nov 2003
TRECVID Workshop at NIST in Gaithersburg, Md.
15 Dec
Workshop papers and slides publicly available
c. 12 Jan 2004
Call for participation in TRECVID 2004 sent out
c. 16 Feb 2004
Applications for participation in TRECVID 2004 due at NIST
  1 Mar 2004
Final versions of TRECVID 2003 papers due at NIST

7. Guideline issues and resolutions:

8. Results, submissions, and evaluated runs for active participants

Submissions and evaluated submissions can be found in the "Past results" section of the TREC website. Access requires that "fair use guidelines" forms be filled out. Instructions are on the Past Results webpage Other products of the evaluation generally available can be found using the following links:

9 Contacts:

National Institute of
Standards and Technology Home Last updated: Tuesday, 15-Jun-2004 09:43:10 EDT
Date created: Monday, 19-Nov-01
For further information contact