Ideas for a rushes task in 2006

(last updated: )

Goals

The purpose of this email discussion list is to come up with a proposal for incorporating additional rushes video into an evaluation in TRECVID 2006. Rushes are, we have heard from the BBC and others a potentially very valuable source of video for reuse but it is largely untapped because it is very difficult to find out what is there. A case can perhaps also be made that rushes share some characteristics of the "ground reconnaissance" data of interest to various intelligence organizations.

This marks the beginning of the discussion among interested 2005 groups. More people may be added once the call for TRECVID 2006 is out next week. The proposal will go to the entire trecvid2006 list when it exists and must be complete well before the end of February.

Here are some facts we can't lose sight of:

We have another 50 hours of BBC rushes about the "French Experience" in MPEG-1. Alan Smeaton can say more about the content and any metadata that might be included.
)We *may* get some data, at least 50 hours of "extra" or "B-roll" video with mostly only natural sound, (poorer quality that we are used to) 300-500 kbps in Windows Media format (320x240), 1-5 mins (avg. 2 mins) clips, with detailed manually-created scene descriptions (which we could possibly mine for feature training and evaluation truth data), and covering a very wide range of topics such as science, travel, adventure, culture, health, history, military, natural history, biography, etc.
An evaluation that is part of TRECVID needs to have all of the following:
- a real world task/scenario we are trying to model and that's important
- a simplified, laboratory version of the above: the system task
- a procedure for evaluating the system's performance of the task against a known reference that represents desirable system performance
We can start off brainstorming about just real world tasks or the system versions but will very soon have to answer the questions about the other facets of the evaluation.
The struction of the evaluation means we usually need
- development data
- reference data for development
- test data (possibly including problem statements, queries, etc.)
- reference data for test (created independent of system output or based on judging submissions)
Right now, all the NIST judging capability is reserved for the search and high-level feature task assessments. But it's possible that about 200 hours of additional judging of some kind could be done in the mid-summer time frame - that would be the usual 10 NIST assessors working 4 hrs/day for 5 days (For reference they work for 14 halfdays judging feature submissions and the same amount of effort is spent in judging search results.)
In designing a rushes evaluation we'd like to do something that is of interest to researchers and to teh donors and the TRECVID sponsor (the US Intelligence Community). E.g., some demonstrated ability to assist searching, browsing, summarization, or even just assigning useful keywords would, as many have heard from various sources in various forums, be much appreciated.

Suggestions

Features for Discovery data

Here is one possible task for discussion. It involves the Discovery data - let's assume for discussion purposes we can get the use of it. (I don't know how to include the BBC data without adding in manual work for training data creation and results judging.)

It's based on the idea that we could (semi-)automatically create training data and truth data for testing by analyzing the given Discovery verbal clip descriptions to determine which test feature names (from LSCOM-lite (39)?, MediaMill (101)?, full LSCOM (?00) ?) are present and therefore which features should be present in each clip.

Real world task - archivist writes short verbal description (can be several phrases, even sentences) of each clip's content - an enormous effort for a large clip library.

System task - assist the archivist by automatically assigning applicable features to each test clip as in the TRECVID 2006 high-level feature task against broadcast news. (Build foundation for better browsing/search)

Evaluation - measure how well (precision/recall, average precision) systems do at finding the clips with each feature. This could be done automatically using existing software given the truth data derived from the content descriptions for the test clips.

Search/browse for relevant segments given text-only query

Here is a suggestion for the rushes task. Interested sites develop a system to allow search, browse and identify relevant shots based on a text-only topic ... mirrors the situation whereby somebody sticks their head in your office and asks you to find shots which are about "X" without having any illustrative examples, oh and by the way, I need them in 5 minutes.

Sites can run whatever analysis they want on the rushes, features, shot bounds, keyframe selection, ASR if they can get a French version.

Sites run a cut-down version of the interactive search task with a 5- minute limit per topic-search, and the task is to find as many as possible. Sites then submit only the "found" shots by giving the video file name and frame offsets for start/end of the "relevant" shots - yes, we could have problems with frame offset numbering here. Immediate disadvantage here is that we loose the common shot boundaries making actual judgments more difficult. Also raises the issue of how much of a shot is needed for it to be relevant, and how relevant is my retrieval of 10 seconds of a shot vs. somebody else's 5 seconds of the same shot ?

NIST then assess but are assessing only the (likely to be) relevant shots, which might require less judgment than the pooling of top- submitted. This might even allow greater than the usual 25 topics.

Might even be possible to do graded relevance judgments here ... relevant/very relevant/not relevant ? Perhaps this could be in terms of how much (how many seconds) of the clip a run has identified ?

Evaluation is in terms of how many relevant (or relevant/very relevant) shots a site finds. We could normalise this per-topic/per- site score for the more difficult topics vs. easier in terms of numbers of relshots found by ALL groups. Might seem crude but it rewards sites/systems which are good at finding a greater number of relshots in a fixed, short amount of time.

Without the usual pre-topic formation work done by NIST whereby evaluators view the content a priori and ascertain that there are actually shots in the collection which are relevant, we might find some topics have zero relevant shots, and some have very many but that's the real world.

Advantages are that this would force us to deal with video in our systems, force us to cut out and mark boundaries of video clips as opposed to pre-defined, pre-bound keyframes.

Emails

Starting with the earliest...

Hi All,

The purpose of this email discussion list is to come up with a proposal
for incorporating additional rushes video into an evaluation in TRECVID
2006. Rushes are, we have heard from the BBC and others a potentially
very valuable source of video for reuse but it is largely untapped because
it is very difficult to find out what is there. A case can perhaps also
be made that rushes share some characteristics of the "ground reconnaissance"
data of interest to various intelligence organizations.

This marks the beginning of the discussion among interested 2005
groups. More people may be added once the call for TRECVID 2006 is out
next week. The proposal will go to the entire trecvid2006 list when it
exists and must be complete well before the end of February.

Here are some facts we can't lose sight of:

1) We have another 50 hours of BBC rushes about the "French Experience" in
MPEG-1. Alan Smeaton can say more about the content and any metadata that
might be included.

2) We *may* get some data from Discovery Communications, Inc - at least
50 hours of what they call "extra" or "B-roll" video with mostly only natural
sound, (poorer quality that we are used to) 300-500 kbps in Windows Media
format, 1-5 mins clips, with detailed manually-created scene descriptions
(which we could possibly mine for feature training and evaluation truth
data), and covering a very wide range of topics such as science, travel,
adventure, culture, health, history, military, natural history, biography,
etc.

3) An evaluation that is part of TRECVID needs to have all of the following:
    - a real world task/scenario we are trying to model and that's important
    - a simplified, laboratory version of the above: the system task
    - a procedure for evaluating the system's performance of the task
          against a known reference that represents desirable system
      performance

We can start off brainstorming about just real world tasks or the system
versions but will very soon have to answer the questions about the other
facets of the evaluation.

4) The struction of the evaluation means we usually need
    - development data
    - reference data for development
    - test data (possibly including problem statements, queries, etc.)
    - reference data for test (created independent of system output or
          based on judging submissions)

5) Right now, all the NIST judging capability is reserved for the search and
high-level feature task assessments. But it's possible that about 200 hours
of additional judging of some kind could be done in the mid-summer time frame
- that would be the usual 10 NIST assessors working 4 hrs/day for 5 days (For
reference they work for 14 halfdays judging feature submissions and the same
amount of effort is spent in judging search results.)

6) In designing a rushes evaluation we'd like to do something that is of
interest to researchers and to BBC and Discovery and the TRECVID sponsor (the
US Intelligence Community). E.g., some demonstrated ability to assist searching,
browsing, summarization, or even just assigning useful keywords would, as many
have heard from various sources in various forums, be much appreciated.

So having thrown all that onto the table.... let the discussion begin. Just
reply-all to keep everyone in the loop.

- Paul Over

Folks

On the BBC rushes data ... BBC sent me 115 CD-Rs of which:

- 101 had proper video files (total 34.2G, about 53 hours);
- 6 are blank;
- 8 contain inappropriate data (not recognised files, some don't  
    play, some look like some applications).

We've downloaded them all, and are putting on a HDD and shipping to
Paul next week.

Each CD has one video file only, with the file name not meaningful in
any way and each video begins with a few seconds of a testcard
(horizontal stripes of different rainbow-like colours), and then the
video. There are some shot cuts in some sequences, but long sequences
are using a single camera ... very like the stuff used in 2005. The
content is as varied as:

- a group of female jazz singers in a studio recording a song;
- a group of people tasting bananas in a lab;
- a farmer cross-pollinating bananas;
- a subject talking to an interviewer in an office setting;

Most of the people in the video are of African origin, much of the
content appears shot in a tropical climate, and ... wait for it ...
for all the content we've looked at the speech is in French.

There is no metadata or description at all and the quality of the
video seems to be very good indeed, though we haven't looked at the
statistics.

This is certainly challenging, forcing a focus on visual aspects, very
real-world.

- Alan Smeaton

Hi,

Here is one possible task for discussion. It involves the Discovery
data - let's assume for discussion purposes we can get the use of it.
(I don't know how to include the BBC data without adding in manual
work for training data creation and results judging.)

It's based on the idea that we could (semi-)automatically create
training data and truth data for testing by analyzing the given
Discovery verbal clip descriptions to determine which test feature
names (from LSCOM-lite (39)?, MediaMill (101)?, full LSCOM (?00) ?)
are present and therefore which features should be present in each
clip.

Real world task - archivist writes short verbal description (can be
several phrases, even sentences) of each clip's content - an enormous
effort for a large clip library.

System task - assist the archivist by automatically assigning
applicable features to each test clip as in the TRECVID 2006
high-level feature task against broadcast news. (Build foundation for
better browsing/search)

Evaluation - measure how well (precision/recall, average precision)
systems do at finding the clips with each feature. This could be done
automatically using existing software given the truth data derived
from the content descriptions for the test clips.

Any thoughts?

- Paul Over

Folks

Here is a suggestion for the rushes task.

Interested sites develop a system to allow search, browse and identify
relevant shots based on a text-only topic ... mirrors the situation
whereby somebody sticks their head in your office and asks you to find
shots which are about "X" without having any illustrative examples, oh
and by the way, I need them in 5 minutes.

Sites can run whatever analysis they want on the rushes, features,
shot bounds, keyframe selection, ASR if they can get a French version.

Sites run a cut-down version of the interactive search task with a 5-
minute limit per topic-search, and the task is to find as many as
possible. Sites then submit only the "found" shots by giving the video
file name and frame offsets for start/end of the "relevant" shots -
yes, we could have problems with frame offset numbering here.
Immediate disadvantage here is that we loose the common shot
boundaries making actual judgments more difficult. Also raises the
issue of how much of a shot is needed for it to be relevant, and how
relevant is my retrieval of 10 seconds of a shot vs. somebody else's 5
seconds of the same shot ?

NIST then assess but are assessing only the (likely to be) relevant
shots, which might require less judgment than the pooling of top-
submitted. This might even allow greater than the usual 25 topics.

Might even be possible to do graded relevance judgments here ...
relevant/very relevant/not relevant ?  Perhaps this could be in terms
of how much (how many seconds) of the clip a run has identified ?

Evaluation is in terms of how many relevant (or relevant/very
relevant) shots a site finds. We could normalise this per-topic/per-
site score for the more difficult topics vs. easier in terms of
numbers of relshots found by ALL groups. Might seem crude but it
rewards sites/systems which are good at finding a greater number of
relshots in a fixed, short amount of time.

Without the usual pre-topic formation work done by NIST whereby
evaluators view the content a priori and ascertain that there are
actually shots in the collection which are relevant, we might find
some topics have zero relevant shots, and some have very many but
that's the real world.

Advantages are that this would force us to deal with video in our
systems, force us to cut out and mark boundaries of video clips as
opposed to pre-defined, pre-bound keyframes.

- Alan Smeaton

Thanks Alan. I'm going to try to keep track of the suggestions here:

    http://www-nlpir.nist.gov/projects/tv2006/rushes06.html

Some comments interspersed below - mostly just to itemize gaps we
would need to fill in eventually if we go down this road.

> Interested sites develop a system to allow search, browse and
> identify relevant shots based on a text-only topic ... mirrors the
> situation whereby somebody sticks their head in your office and asks
> you to find shots which are about "X" without having any illustrative
> examples, oh and by the way, I need them in 5 minutes.  Sites can
> run whatever analysis they want on the rushes, features, shot bounds,
> keyframe selection, ASR if they can get a French version.

So we would potentially be measuring more than just the system code/
algorithms.  :-(

> Sites run a cut-down version of the interactive search task with a
> 5- minute limit per topic-search, and the task is to find as many as
> possible.

Assume only interactive runs allowed? - since getting from text to
video-without-speech will require a human in the loop, interactive
system will return less junk and so take pressure off assessing, and
want to encourage interactive systems generally.

> Sites then submit only the "found" shots by giving the video file
> name and frame offsets for start/end of the "relevant" shots - yes, we
> could have problems with frame offset numbering here.  Immediate
> disadvantage here is that we loose the common shot boundaries making
> actual judgments more difficult. Also raises the issue of how much of
> a shot is needed for it to be relevant, and how relevant is my
> retrieval of 10 seconds of a shot vs.  somebody else's 5 seconds of
> the same shot ?

Maybe use start time of segment that meets the topic's need to avoid
frame number variation

Maybe use a strict length for each returned item: e.g., 5 secs. For
longer relevant segments, return multiple items.

> NIST then assess but are assessing only the (likely to be) relevant
> shots, which might require less judgment than the pooling of top-
> submitted. This might even allow greater than the usual 25 topics.

Who at NIST assesses, when, using what system??? (See goals section 5)

> Might even be possible to do graded relevance judgments here ...
> relevant/very relevant/not relevant ?  Perhaps this could be in terms
> of how much (how many seconds) of the clip a run has identified ? 
> Evaluation is in terms of how many relevant (or relevant/very
> relevant) shots a site finds. We could normalise this per-topic/ per-
> site score for the more difficult topics vs. easier in terms of
> numbers of relshots found by ALL groups. Might seem crude but it
> rewards sites/systems which are good at finding a greater number of
> relshots in a fixed, short amount of time.  Without the usual
> pre-topic formation work done by NIST whereby evaluators view the
> content a priori and ascertain that there are actually shots in the
> collection which are relevant, we might find some topics have zero
> relevant shots, and some have very many but that's the real world.

Who will make up the topics?

> Advantages are that this would force us to deal with video in our
> systems, force us to cut out and mark boundaries of video clips as
> opposed to pre-defined, pre-bound keyframes.

- Paul Over

Hi all,

Regarding the Rushes task for 2006, last year the plan for the 2005 Rushes
task was to explore options and based on our experiences, to have a
well-defined task this year.  Thus far, it seems that the task will be
similar to the Interactive Search task of TrecVid 2005, which is good.  From
out point of view in CDVP, DCU, we hope to further develop our object
segmentation tools used in the rushes task in 2005 and evaluate using an
interactive system.



>> Assume only interactive runs allowed? - since getting from text to
>> video-without-speech will require a human in the loop, interactive 
>> system will return less junk and so take pressure off assessing, and 
>> want to encourage interactive systems generally.


I think that focusing on interactive search is a good idea.  If we are
modeling a real-world scenario where a user is seeking a number of video
clips then the top 1,000 does not make much sense and as stated, less
submitted results (high precision) will help the rushes task to be evaluated
in the 200 hours available.  It could be possible that the judgments become
a shared task among the participants?  I am not sure how participants feel
about this, but with less submitted runs for evaluation, this would require
significantly less effort for evaluation.



>> Who will make up the topics?


Can we as participants suggest candidate topics once the data has been
distributed, and a subset of these selected by NIST for test and development
collections?  Participants have suggested topics before in a previous
TrecVid.  Regarding the nature of the topics, Alan's suggestion of text only
topics is sensible.  Participants can allow their interactive users to
formulate their queries whichever way they see fit.  Last year for example,
we used Google Image search to aid query formulation, or some participants
could rely on ASR through French.



>>>> Sites then submit only the "found" shots by giving the  video file 
>>>> name and frame offsets for start/end of the "relevant"  shots - yes, 
>>>> we could have problems with frame offset numbering here.


Regarding the use of start time / end time of a video segment in the result
submissions, are we not better off still relying on predefined shot
boundaries?  It will make for easier system comparison, pooled evaluations
and lower the development effort by participants.  The results submitted
could be comprised of either a ranked list or non-ranked set of sequential
shot clusters for each topic.  Or alternatively, provide the SB definitions
and keyframes to lower the required  development effort, but accept result
submissions where the start time / end time of submitted result video
segments are defined.  In any case, I think that the provision of SB
definitions and keyframes will be useful for participants.

regards
Cathal Gurrin

Hi all,

I do support the tasks of annotation and search on rushes. But should we
assume that the users of rushes are more likely the experts (eg filmmaker,
someone finding useful segments for composing new videos)? If this is the
case, we should include more queries/topics with camera setting (eg,
camera range, camera angle, camera motion, focus/defocus object, lighting
source). The examples may be like:

- Find X appeared in close-up shot (or medium, long-distance shot)
- Find establishing shot
- Find shot with camera looking up/down something
- Find objects X and Y, with X in focus, and Y is defocused
- Find shot with one object being tracked by camera

This could probably make the rushes task different from traditional
high-level feature extraction and search tasks.

In addition, since we are dealing with unedited videos, we may need the
sub-shot boundary as well. Example: a clip may only have one shot, and one
shot may last for more than 10 min -- containing segments with fast
zoom/pan, shaking artifact..... It would be useful if we can have, for
instance, one keyframe for each sub-shot. This can ease annotation,
search, as well as system evaluation.

Regards,
CW Ngo

Paul
      Pardon me for joining this list late. I am looking at the notes so
far. It seems that if you want to replicate some broadcast domain tasks on
the rushes data and then add some new rushes specific tasks. For feature
detection, training annotation and evaluation will be an issue, and for
search, evaluation will be an issue. Do you have any feel for how likely
people are to go through another annotation task this year? Applying models
built from some other domain to Discovery to bootstrap the semi-automatic
annotation may not work given the diverse nature of the content. So before
we go for any task over any other task, it is necessary to get your opinion
on how much work you feel people are willing to put in for another round of
voluntary annotation. This is better understood by us in context of the
scope of the broadcast part of TRECVID 2006 because people will have to
split their time between the tasks on these two domains. With a rather
fixed pool of resources at all sites which I assume will not double from
2005 to 2006, we have to find out from NIST, what your priorities will be.
These answers will impact significantly tasks, that can be feasibly framed
and evaluated.

Again, my apologies if these questions have already been answered earlier.

Thanks
Sincerely
Milind Naphade

Hi Milind,

Good to hear from you.

Milind Naphade wrote:

>       Pardon me for joining this list late. I am looking at the notes so
> far. It seems that if you want to replicate some broadcast domain tasks on
> the rushes data and then add some new rushes specific tasks. For feature
> detection, training annotation and evaluation will be an issue, and for
> search, evaluation will be an issue. 

The proposal for a high-level feature task against Discovery data asks whether
we can get the training and truth data for evaluation semi-automatically
by matching some set of feature names against the words in the Discovery verbal
content descriptions. If this does not yield a set of features with sufficient
examples, then clearly the proposal fails and its back to the drawing board.

As for the search proposal, yes, there are more questions about who will do
what. Alan addressed some issues. I raised some more questions. Cathal made
some suggestions. We have not heard from anyone else. I make no assumptions.
The point of the discussion is to see if we can come up with a task *and *
evaluation  - including all the pieces. (In my initial note I mentioned NIST
might be able to do a week of manual assessments but this would have to be
earlier than the usual assessments.)

> Do you have any feel for how likely
> people are to go through another annotation task this year? 

I am assuming no new annotation.

> Applying models
> built from some other domain to Discovery to bootstrap the semi-automatic
> annotation may not work given the diverse nature of the content.

OK, but this was not part of the proposal as I saw it.

> So before
> we go for any task over any other task, it is necessary to get your opinion
> on how much work you feel people are willing to put in for another round of
> voluntary annotation. 

I am still assuming no new annotation.

> This is better understood by us in context of the
> scope of the broadcast part of TRECVID 2006 because people will have to
> split their time between the tasks on these two domains. 

Some groups may decide not to do everything and so do not need to split their
time.

> With a rather
> fixed pool of resources at all sites which I assume will not double from
> 2005 to 2006, we have to find out from NIST, what your priorities will be.

NIST's top priority has to be the completion of the 2-year cycle on news video
but we are trying within limited resources to help the community start work on
some other kinds of video also of interest to the TRECVID sponsors.

- Paul Over

Hi all,

here are some additional comments from our group, but at first a question:

Are there any examples of manual descriptions available and in
particular, are these descriptions about the content (who and what is
recorded) or about the recording itself (how was it recorded,
close-ups, tracking objects etc)? "Verbal" - means textual?

Overall, we would prefer queries and topic searches as proposed by CW
Ngo (find close-up of person X, find object X being tracked by camera,
person X is speaking, Y listens). If the descriptions mentioned above
are in this way this kind of search could be easily combined with
Paul's first suggestion for a task. And, to make use of these manual
descriptions might help to save evaluation time. But we are not sure
whether they are suitable for training purposes (supposing a great
diversity in the recordings/genres)?

Regarding Alan's suggestion:

It is a good idea/scenario as well. But, in the first run, how do
systems come from a text query to visual content assuming that there
is eventually no ASR? Is it intended that this kind of task forces the
use of knowledge databases/ontologies? What about adding visual
queries of a certain object/person of interest? Alternatively it would
be difficult to assume a generic object detector.

Some more ideas for queries/search:

Retrieve sequences according to their recording quality (so that a
content producer is willing to use it), quality e.g. in terms of:

- camera is absolutely still (no camera shaking), respectively
- there is a smooth camera movement
- light, sharpness, audio quality

Some further ideas:
- Retrieve sequences where person X is present
- Retrieve sequences where person X is (not) speaking.
- Retrieve sequences where person X (or any person) shows a certain emotional, facial expression.
- retrieve sequences with certain audio features (speech, music, silence, etc.)

Evaluation:
Since shots in the rushes are probably very long we suggest to use a
predefined sub-shot length (of about 2-5 seconds) as the retrieval
unit. A retrieved long sequence would be (virtually) divided into
these sub-shots and thus retrieving long relevant/irrelevant shot
would enhance/degrade the precision measure.

Ralph Ewerth

Hi,

Some comments from UEA on the BBC rushes scenario:

- we like Alan Smeaton's idea of using 5-second chunks rather than
shots with predefined boundaries

- relevance should be not relevant/somewhat relevant/relevant (we
tried looking at some of last year's BBC rushes cut into 5 second
chunks and believe we need the middle category for chunks where a
small part is relevant),

- if the evaluation was based on precision and ignored recall it
should make the job easier,

- if we focus on precision, evaluation could be a community task with
each run evaluated by 2+ others, preferably with an evolving list of
known results (e.g. if 3 judge independently agree, a segment can be
automatically classified without the need for further judgements).

 Dan Smith

Ralph

Quick answers to some of your questions

> > Are there any examples of manual descriptions available and in
particular, are these descriptions about the content (who and what is
recorded) or about the recording itself (how was it recorded,
close-ups, tracking objects etc)? "Verbal" - means textual?


Nope. There is no indication in the 50 hours we have got from BBC last
month of what the content is. There are only about 100+ individual
MPEG-1 files of 20 minutes or more each, with no hint in the filename
either.

> Regarding Alan's suggestion:

> It is a good idea/scenario as well. But, in the first run, how do
systems come from a text query to visual content assuming that there
is eventually no ASR? Is it intended that this kind of task forces the
use of knowledge databases/ontologies? What about adding visual
queries of a certain object/person of interest?  Alternatively it
would be difficult to assume a generic object detector.


The content doesn't appear to have any people of notable interest -
there are no George Bush or Tony Blair people, it appears to be
"ordinary" people so the searches could not be for named individuals
but might be for a person in front of a banana tree where the person
could be anybody. To kickstart a visual-only search, one approach
would be to have the video analysed and classified a priori into an
ontology of features, and another could be to source a representative
image from an outside resource -- like Google images. If you do a
Google image search for "person banana tee" you will get 5 screens of
images and the 2nd and 3rd pages have pictures of a person (smiling
face of a child) in front of a banana tree, so you could use that as a
seed for an image-only. [Paul won't like that idea because it means
the search is not repeatable and in fact the pictures of that smiling
child do not appear to be at their original URL any more so you would
have to use the Google cache to retrieve the full image, but if we
insisted that search runs also submit any outside resources like query
images that sould be OK.]

> Some more ideas for queries/search:
>
> Retrieve sequences according to their recording quality (so that a  content producer is willing to use it), quality e.g. in terms of:
> - camera is absolutely still (no camera shaking), respectively
> - there is a smooth camera movement
> - light, sharpness, audio quality


It seems the original recording quality on this content was good,
using good cameras so there isn't much camera shake and the movement
appears *mostly* smooth, so it was probably recorded in high quality
and digitised to MPEG-1.

>
> Some further ideas:
> - Retrieve sequences where person X is present
> - Retrieve sequences where person X is (not) speaking.
> - Retrieve sequences where person X (or any person) shows a certain  emotional, facial expression.


These would work if person X wasn't a famous person but was somebody
who was known to appear in the footage. One issue is that the people
in the rushes seem to be ordinary people, passers-by, and so don't re-
occur across different video files, so once you have found one
instance of the person fertilizing the banana tree that person's other
appearances are localised and clustered.

> - retrieve sequences with certain audio features (speech, music,  silence, etc.)
>
> Evaluation:
> Since shots in the rushes are probably very long we suggest to use a
> predefined sub-shot length (of about 2-5 seconds) as the retrieval  unit. A retrieved long sequence would be (virtually) divided into  these sub-shots and thus retrieving long relevant/irrelevant shot  would
> enhance/degrade the precision measure.


That would remove the contentious issue of having a master shot
reference I think.

- Alan

Hi All,

I fully agree with Alan's idea to have BBC rushes cut into, say 5
second chunks and we can perhaps classify these chunks into meaningful
categories. I also agree with Dan that we should give importance to
precision and ignore recall that would make the job easier for both
NIST and the participants.

Lekha Chaisorn

Ralph Ewerth wrote:

> Are there any examples of manual descriptions available and in
> particular, are these descriptions about the content (who and what is
> recorded) or about the recording itself (how was it recorded,
> close-ups, tracking objects etc)? "Verbal" - means textual?

Ralph,

Were you thinking about the Discovery data when you asked the above question?

If so,... No, I don't have any examples yet. But I have seen some and
they contained natural language text - a short paragraph, sentences,
phrases, describing what you see when you view the clip. In some cases
there are descriptions of the camera position and movement as well.

Discovery is currently working on a script to check the MediaMill 101
feature names against the the text of their content descriptions to
see how many hits they get for each feature.

- Paul Over

Hi,

Regarding evaluation measures, I would like to plug the T2I framework
that was published at RIAO 2004
(http://www.cwi.nl/~arjen/pub/t2i.pdf). It addresses the problems
caused by 'overlap' between result and reference items.

The idea here is that systems return only entry points into the videos
(the starting time of their returned segments); the ground truth
should have labeled correct segments (start+end). The model assumes
that users views video until their so-called tolerance to irrelevance
(T2I) has been reached, and then proceed to the next suggested entry
point.

Success can then be measured given a fixed amount of time that the
user would want to waste (this principle dates back to the expected
search length introduced by Cooper). For example, one could count the
number of relevant fragments found before reaching this wasted effort
(this is similar to precision at N); we also gave some alternative
measures in the same framework.

Best regards,

Arjen de Vries

National Institute of
Standards and Technology Home

Last updated:
Date created: Monday, 30-Jan-06
For further information contact