TREC 2005 Video Retrieval Evaluation

Inputs to TRECVID 2006 planning and some initial responses

o High-level features

  * Push development of generic approaches to detector development by
    requiring groups to submit results for all features in the common
    annotation. NIST to choose a subset for evaluation?

    ->It seems time to adopt this requirement. Almost all groups
      participating in the 2005 high-level feature task, submitted
      runs for all features.

    ? There remains the issue of deciding well in advance of the
      evaluation, how NIST will choose the subset of features
      to be evaluated manually. 

  * Collect and report information on computational effort?

    ->At this point in the maturity of the technologies, we don't want
      to discourage approaches that are slower but effective. Also,
      training time is difficult to define because training may
      involve multiple cycles of training, testing, and system
      revision. Nevertheless, reporting computation time during final
      system test as supporting information about a run, seems useful
      enough to require it.

  * Require run based on a donated set of low-level features
    (CMU's, MPEG-7's,...) for at least master shot keyframes?

    ->It's not clear that the benefit of requiring such a run
      from everyone, would outweigh the cost (yet another pre-defined
      run type) or how many participants are really interested.
      We could allow submissions to let us know if they restricted
      their use of low-level features to the "official set(s)".

  * Provide a common discriminantive modeling baseline system?

    ->No proposal 


o Camera motion

  ->Although we had included camera motion in a few search topics or
    features in previous years, in 2005 we devoted a full task to it
    and learned a lot - not least about the difficulty of creating the
    truth data. As a result, we do not intend to continue this task as
    a separate focus in 2006 though camera motion can continue to be
    part of topics in the search task and perhaps find a natural place
    in a task using unproduced video.

  * Use finer granularity than master shots? Perhaps a random
    sample of arbitrary segments of some small abitrary length?

    ->Seems length should be tied to the lengths of real segments
      people could expect to find in an archive they are searching for
      video material.

  * Change task significantly 

    * Make this a quantification, not a binary classification task?

      ->It is already very expensive to create truth data; this would
      make it worse. Also, it is pretty clear from real queries and
      input from some real users that this is not how real users
      specify movement in their requests

    * Define minimum motion?

      ->Not appropriate or practical. See above.

    * Include search for "change of focus" etc.?

      ->While "change of focus" is ill-defined for the purposes of
      evaluation, it might be interesting to use a real archival
      system's set of shot type features in place of simple motion
      features. However, this runs into the cost of creating truth
      data (if it is not given) and we have no idea how many systems
      might be able to participate.


o Rushes	

  * There was considerable interest in continuing work on unproduced
    video in some form. There is significant commercial interest and
    it is hoped some of the research results may apply to surveillance
    video.

  ->Hope to have significantly larger amount of rushes from BBC but
    probably without metadata. Possibility we will also have similar
    extra video material from other sources.

    Beyond getting the data in a usable format with permission for its
    use by TRECVID, there are MANY questions to be answered about the
    evalution. What is the real task being modeled, where can get
    training and ground truth or judgment resources,etc.

  * Task

    * Stock shot finding?

    ->Known to be a useful task but not clear how to evaluate
      it, since the concept of "stock shot" is not well enough
      defined.

    * Known item search using
      skimming/summarization/clustering/exploration?

    ->Topic creation is probably doable for NIST if we can convince
      ourselves there is only one answer for each topic, but we would
      need a large number of topics. But what is the real task we are
      modeling? A person who WAS familiar with the material but has
      forgotten how to find the segment s/he remembers? What do the
      systems return? A point in time in a video such that the known
      item occurs no more than x secs after the given starting point?

    * Direct evaluation of clustering analysis (a la TDT?) ?  Less
      like clustering, more like assignment of labels to (existing?)
      clusters?

    ->Details unclear

    * Given a clip from outsite the collection, find similar ones in
      the test collection?

    ->Problem: deciding whether two clips are similar is extremely
      subjective - much more so than deciding if a clip contains an
      object, event, person, etc.

  * Other comments: Look at similarities to UAV data look for events?
    Need info on real user task characteristics.  Will someone donate
    camera motion detection data?  No common unit of retrieval?


o Search

  * Rename "manual" as "automated"

    ->Not a problem

  * More info in the overview on effect of various factors (e.g.,
    text-only vs text+video)

    ->Trying this already in the 2005 overview paper

  * Drop manual search type?

      ->Harm is not clear, since manual runs are not required but
        provides way of estimating topic-to-query translation effect
        within search.

  * Evaluate automatic runs using precision/recall at fixed depth?

      ->We already include such partial measures in the run results
        pages but we do not use them as featured overall (precision
        and recall) measures because they do not reflect recall and do
        not average well over topics with varying numbers of relevant
        shots.

  * Use the LSCOM use cases to generate TV6 topics?

      ->It is problematic starting from a list of interesting events,
      people, etc. and then testing them to see which occur often
      enough in the development and test data. We have no efficient
      means to do this. We tried something similar before starting
      from the NYT daily news digest and spent most of our time
      chasing down deadends. It would be possible to have the LSCOM
      topics in mind while watching the development/test video in
      search of candidate topic targets.

  * Use ellicit multi-valued relevance judgments from assessors?

      ->No clear benefit presented

  * Need volunteer to define master shot boundaries

      -> Christian Petersohn at the Fraunhofer (Heinrich Hertz)
      Institute is willing to do this again for continuity. Thanks!

  * Need volunteer to complete master shot reference and extract
    keyframes. 

      -> DCU is willing to do this again for continuity. Thanks!


o General

  * Assess impact of "near duplicates"? Remove commercials from test sets?

      ->This has been an open issue for some time. NIST is looking at
        this question. It's not clear that removing commercials from
        the development/collection is the right answer, if only
        because real systems will need to deal with commercials in
        some way or other.

  * Require rerun of 2005 system on 2006 data to assess data
    effect? or rerun of 2006 system on 2005 data.

      ->While it would be great if all participants could do this, we
        hesitate to require everyone. At least though, one might
        expect VACE/DTO contractors to do this and information from
        from their runs might help estimate the data effect for all
        systems.

  * Emphasize video aspects (vs static)

      ->Not clear what was intended here. 1) Is this a change (in
        emphasis?) from what representatives of the intelligence
        community have told us previously? - namely that they are very
        interested in being able to find objects, people, and
        locations, not exclusively or mainly events. 2) Is this a
        suggestion or requirement for researchers to approach the task
        using more than keyframes to represent the video?

  * New automatic "near duplicate" detection task

      ->Not sure how to define this. Find clips that are very similar
        at a low level because they contain some of the very same
        decoded frames (in the same order?)? Or are we talking (too)
        about higher order similarity, e.g., shots that film the same
        event from a different angle?

  * Location?

    * NIST pre-TREC on 13-14  November, in Gaithersburg, MD