---------------------------------------------------------------------
TRECVID 2007: Instructions for creation of video summary ground truth
Version 7

B a c k g r o u n d:

A good video summary shows the viewer segments containing examples of
the main objects and events depicted in the video it summarizes,
filtering out the *unclear* and the *predictable*. One way to evaluate
such a summary is to have a human summarizer create a filtered list of
such segments, each identified uniquely in terms of an object or
event. Then the summary can be compared to the list to see how many of
the desired objects/events (i.e., segments) it contains.

Your task is to watch a video, select desirable segments, and then
identify each uniquely by noting an object (animate or inanimate) or
event (i.e., one or more objects involved in some action) occuring in
the segment. The number of segments will vary with the video. That is
OK.

It is the nature of rushes that some scenes and parts of scenes will
be shot multiple times. The varations in such retakes, while important
to the director, will likely be below the level that matters to a
highly compressed summary. That is, the summary need only include one
instance. An exception might be something that goes wrong and might
have a separate use from other takes that proceed mostly as expected.

A desirable segment should not cross shot boundaries. You may identify
multiple such segments within a single shot. Try not to include
extremely short segments separately unless they seems very interesting.

You can include segments from the unscripted portion of the video if
they are substantial enough and seem as though they might be
reusable. However, DO NOT include the starting/ending clap boards of
scenes and takes or the color bars at the beginning.

The objects/event cue for each desired segment should be as simple as
possible while still identifying the segment uniquely within the
video. Uniqueness is primary. For example if there are two women in
the video and you want to include two segments (a closeup of each),
you will need to specify some distinguishing modifiers in your list,
e.g., "woman with glasses" versus "woman with red hair", so the person
judging the summary against your list can tell when s/he has seen each
of the women you designated. Use clear, concrete language - no
specialized terminology - in each item.

Each item needs to be independent of context - should not refer
an other other, e.g., "view of road from different angle".
Item should be clear even if we randomized the order of the list
or used only a subset.

Many videos contain alternate shots of some object/person at different
ranges. Be sure to make clear which is which - this may mean
mentioning what is visible (should and head vs head only).


It should take one of the following forms.

  - object (no event or camera event)
    e.g., antique car
          old woman

  - object(s) + event
    e.g., red hot air balloon ascending
          people talking

  - object(s) + camera event
    e.g., pan across room
              zoom in on newspaper page

  - object(s) + event + camera event*
    e.g., zoom in on red hot air balloon ascending
              zoom in on blimp's cabin touching the water

  *The set of allowable camera events is limited to the following:
   zoom in, zoom out, or pan. Remember a zoom or pan is an event.
   A closeup is a state.

In your annotation, list one segment, i.e., one object/event
per line.

P r o c e d u r e:

Play the video at normal speed through one take of the scene, pick the
distinct segments you want to select and enter them on the list as
described above. Rewatch the scene to suppliment/check the list. Fast
Forward through the other takes of the scene unless something really
different and interesting happens. Continue in same fashion with any
remaining scenes.

C h e c k l i s t:

1. Is each line in your groundtruth UNIQUE? (as no two lines should be
the same)
2. Is each line in your groundtruth INDEPENDENT? (as each line should
stand on its own, eg "view of road from different angle" is NOT
independent as it assumes you know what the original angle was before it
became "different")
3. Is each line/event you have listed SIGNIFICANT? (don't list something
unless it is clear and complete enough to be useful once found, except
if its presence is surprising enough to trump its obscurity or
incompleteness)
4. Is there ONE OBJECT/EVENT per line? (there should be no more than 1)
5. Does any line have any UNNECESSARY DETAIL? (only the minimum amount
of detail that is needed to uniquely describe a line should be given)
6. Is there any line with only CAMERA MOVEMENT? (e.g "Camera Pans Right"
probably needs more substance as it unlikely to be the only time in the
video when the camera pans right, something like "Camera Pans Right onto
an object" gives a more accurate description)

so

1. UNIQUE?
2. INDEPENDENT?
3. SIGNIFICANT?
4. ONE OBJECT/EVENT?
5. UNNECESSARY DETAIL?
6. CAMERA MOVEMENT?