Guidelines for the TREC-2002 Video Track

(last updated: Monday, 13-Dec-2004 09:39:08 EST)

1. Goal

Promote progress in content-based retrieval from digital video via open, metrics-based evaluation.

2. Tasks

There are three main tasks with tests associated and participants in the Video Track must complete at least one of these three:

There is one voluntary auxilliary task with no direct test involved:

Many systems will accomplish the auxilliary task by running part of their search system on the search test set early enough to produce feature output to share. Or they may run a stand-alone feature extraction system for the same purpose.

The feature extraction, search, and auxilliary tasks come in two flavors, depending on whether system development and processing:

Each feature extraction task submission, search task submission, or donation of extracted features must declare its type: A or B. Search submissions of type B cannot make use of feature donations of type A. Search submission of type A can make use of feature donations of either type.

Available feature donations:

In both the feature extraction and search tasks, systems may be compared within and/or across types.

Available keyframe donation:

The team at Dublin City University has selected a keyframe for each shot from the common shot reference in the search test collection and these are available for download.

2.1 Shot detection:

Identify the shot boundaries with their location and type (cut or gradual) in the given video clip(s) -

2.2 Feature extraction:

Various semantic features concepts such as "Indoor/Outdoor", "People", "Speech" etc. occur frequently in video databases. The proposed task has the following objectives:

  1. To begin work on a benchmark for evaluating the effectiveness of detection methods for semantic concepts
  2. To allow exchange of feature detection output based on the TREC Video Track search test set prior to the search task results submission date so that participants can explore innovative ways of leveraging those detectors in answering the search task queries.

The task is as follows. Given a standard set of shot boundaries for the feature extraction test collection and a list of feature definitions, participants will return for each feature the list of at most the top 1000 video shots from the standard set, ranked according to the highest possibility of detecting the presence of the feature. Each feature is assumed to be binary, i.e., it is either present or absent in the given standard video shot.

Some participants may wish to make their feature detection output available to participants in the search task. By August 1, these participants should provide the detection over the search test collection in the feature exchange format.

Description of features to be detected:

These descriptions are meant to be clear to humans, e.g., assessors/annotators creating truth data and system developers attempting to automate feature detection. They are not meant to indicate how automatic detection should be achieved.

If the feature is true for some frame (sequence) within the shot, then it is true for the shot; and vice versa. This is a simplifaction adopted for the benefits it affords in pooling of results and approximating the basis for calculating recall.

  1. Outdoors: Segment contains a recognizably outdoor location, i.e., one outside of buildings. Should exclude all scenes that are indoors or are close-ups of objects (even if the objects are outdoor).
  2. Indoors: Segment contains a recognizably indoor location, i.e., inside a building. Should exclude all scenes that are outdoors or are close-ups of objects (even if the objects are indoor).
  3. Face: Segment contains at least one human face with the nose, mouth, and both eyes visible. Pictures of a face meeting the above conditions count.
  4. People: Segment contains a group of two more humans, each of which is at least partially visible and is recognizable as a human.
  5. Cityscape: Segment contains a recognizably city/urban/suburban setting.
  6. Landscape: Segment contains a predominantly natural inland setting, i.e., one with little or no evidence of development by humans. For example, scenes consisting mostly of plowed/planted fields, pastures, orchards would be excluded. Some buildings, if small features on the overall landscape, should be OK. Scenes with bodies of water that are clearly inland may be included.
  7. Text Overlay: Segment contains superimposed text large enough to be read.
  8. Speech: A human voice uttering words is recognizable as such in this segment
  9. Instrumental Sound: Sound produced by one or more musical instruments is recognizable as such in this segment. Included are percussion instruments.
  10. Monologue: Segment contains an event in which a single person is at least partially visible and speaks for a long time without interruption by another speaker. Pauses are ok if short.

2.3 Search:

Given a multimedia statement of information need and the common shot boundary reference for the search test collection, return a ranked list of 100 shots from the standard set, which best satisfy the need. Please note the following restrictions for this task:

  1. The only manually created collection information search systems are allowed to use will be the video titles, brief descriptions, and descriptors accessible on the Internet Archive and Open Video websites.
  2. For TREC 2002 the track will set aside the challenging problem of fully automatic topic analysis and query generation. Submissions will be restricted to those with a human in the loop, i.e., manual or interactive runs as defined below. Once we have a better idea what features make sense for the TREC 2002 video topics and collection, attempts to automate the extraction of such features can be more efficiently investigated.

  3. graphic description of run types
  4. Because the choice of features and their combination for search is an open research question, no attempt will be made in TREC 2002 to restrict groups with respect to their use of features in search. However, groups making manual runs must report their queries, query features, and feature definitions. These might serve as a basis for a common query language next year, if such a language is justified.
  5. Systems are allowed to use transcripts created by automatic speech recognition (ASR). But it is recommended that if an X+ASR run is submitted, then either an ASR-only or X-only run must also be submitted so that the effect of ASR can be measured. All runs count in determining how many runs a group submits.
  6. Groups submitting interactive runs must include a measure of time spent in browsing/searching for each search - minutes of elapsed time from the time the searcher is given the topic to the time the search effort ends.
    Such groups are encouraged to measure user characteristics and satisfaction as well and report this with their results, but they need not submit this information to NIST. Here are some examples of instruments for collection of user characteristics/satisfaction data developed and used by the TREC Interactive Track for several years.

3. Framework for the Video Track

The TREC 2002 Video Track is a laboratory-style evaluation that attempts to model real world situations.

3.1 Search

In the case of the search task, the evaluation attempts to model the operational situation in which a search system, tuned to a largely static collection, is confronted with an as yet unseen query - much as would happen when an end-user with an information need steps up to a operational search system connected to a large, static video archive and begins a search.

In this model the system is adapted before the test to the test collection and to some basic assumptions about what kinds of information needs it may encounter (e.g., requests for video material on specific people, examples of classes of people, specific objects, examples of classes of object and so analogously for events, locations, etc. - as noted in the track guidelines). The system is at no point adapted to particular test topics.

In the TREC Video Track the amount of time allowed to adapt the system to the test collection is a covariant, not an independent variable. Research groups interested in determining how well a system adapted solely to data from outside the search test collection performs, can use the feature extraction development/test collection or data from the TREC-2001 Video Track rather than the TREC-2002 search test collection to develop their system.

For TREC 2002 the topics will be developed by NIST who will control their distribution. Once the topics are received and before query construction begins, the system must be frozen. It cannot be adapted to the topics/queries. The time allowed for query construction will be narrowly limited only if the participants want this. Even in this case the topics will be available in time to give groups considerable freedom in when they actually freeze their system, construct the queries, and run the test.

3.2 Feature extraction and shot boundary determination

In the case of feature extraction and shot boundary determination, the set of features or shot types is known during system development. The test data themselves are unknown until test time, although high-level information about it is assumed to be available. Other data, believed similar enough to the test data to support system training/validation, are available before the test. Such data could be provided by a customer for whom the system is being developed, drawn from other sources of judged similar on the basis of high-level comparison, etc.

Video data

The files from the Open Video collection will be used as they exist on the site in MPEG-1 format. The Internet Archive (IA) videos available in VCD (MPEG-1) format as they exist on the site in VCD format. NIST assumes participating groups will be able to download the files from the IA and OV.

For video data, all experiments will use only the MPEG-1/VCD files listed in the collection definition. Additional experiments using other formats are not discouraged, but will be outside the evaluation framework of the track for 2002.

4.1 Common shot boundary reference:

Here is the common shot boundary reference for use in the search and feature extraction tasks. The emphasis here is on the shots, not the transitions.

The shots are contiguous. There are no gaps between them. They do not overlap. The media time format is based on the Gregorian day time (ISO 8601) norm. Fractions are defined by counting pre-specified fractions of a second. In our case, the frame rate is 29.97. One fraction of a second is thus specified as "PT1001N30000F".

TREC video id has the format of "XXX" and shot id "shotXXX_YYY". The "XXX" is the sequence number of video onto which the video file name is mapped, this is based on the "collection.xml" file. The "YYY" is the sequence number of the shot.

The directory contains these file(type)s:

4.2 Data for the search task:

NIST randomly choose about 40 hours from the identified collection to be used as the search test collection. This is approximately four times the amount of data used for the search test collection in TREC-2001.

With the exception of the feature extraction test, search task participants are free to use any video material for developing their search systems (including the search test collection and the feature extraction development collection). The sole use of the feature extraction test set in TREC-2002 is for testing in the feature extraction task.

4.3 Data for the feature extraction task:

The approximately 30 hours of the identified collection left after the search test collection has been removed has been designated the feature extraction collections. It was partitioned randomly into a feature development (training and validation) collection and a feature test collection.

The feature development collection will be available for participants to train their feature extractors for the feature extraction task as well as feature-based search systems for the search task. With the exception of the feature extraction test set, feature extraction task participants are free to use any video (including the search test collection) for developing their systems. None of the feature extraction test set should in any way be used to create feature extractors. The sole use of the feature extraction test set in TREC-2002 is for testing in the feature extraction task.

4.4 Data for the shot boundary detection task:

The shot boundary (SB) test collection will be about 5 hours of videos similar to but from outside the those in the 68 hours of the identified collection. It will be chosen by NIST and announced in time for participants to download it but not before.

4.5 Summary of tasks and available data

Graphic summary of tasks and available data

5. Information needs and topics

5.1 Example types of informations needs

I'm interested in video material / information about:
  1. a specific person

  2. e.g., I want all the information you have on Ronald Reagan.
  3. one or more instances of a category of people

  4. e.g., Find me some footage of men wearing hardhats.
  5. a specific thing

  6. e.g., I'm interested in any material on Hoover Dam.
  7. one or more instances of a category of things

  8. e.g., I need footage of helicopters.
  9. a specific event/activity

  10. e.g., I'm looking for a clip of Ronald Reagan reading a speech about the space shuttle
  11. one or more instances of a category of events/activities

  12. e.g., I want to include several different clips of rockets taking off. I need to explain what cavitation is all about.
  13. other?

5.2 Topics:

6. Submissions and Evaluation

6.1 Shot boundary detection

6.2 Feature extraction

6.3 Search

7. Milestones for TREC-2002 Video Track

14. Feb
Short application due at NIST. See bottom of Call for Participation
  1. May
Guidelines complete (includes definition of search, feature extraction development, and feature extraction test collections)
  20. May
Common shot boundary reference for the search test collection available
  1. Jul
ASR transcripts available from groups willing to share in this MPEG-7 format.
22. Jul
Shot boundary test collection defined
  1. Aug
Any donated MPEG-7 output of feature extraction on the search test collection due/available in this format.
  8. Aug
Search topics available for active participants
11. Aug
Shot boundary detection submissions due at NIST for evaluation.
  2. Sep
Feature extraction task submission due at NIST for evaluation. See section 6.2 above for details
  8. Sep
Search task submission due at NIST for evaluation. See section 6.3 above for details
11. Oct
Results of evaluations returned to participants
  3. Nov
Conference notebook papers due at NIST
19.-22. Nov
TREC 2002 Conference at NIST in Gaithersburg, Md.

8. Results (evaluated submissions and notebook papers for submitters)

The following links lead to information on the results of the three evaluations. Use the password provided to those who submitted runs for evaluation.

9. Guideline issues to be resolved

10. Contacts:

National Institute of
Standards and Technology Home Last updated: Monday, 13-Dec-2004 09:39:08 EST
Date created: Monday, 19-Nov-01
For further information contact Paul Over (