Guidelines for the TRECVID 2008 Evaluation
(last updated: Wednesday, 14-May-08 09:48:42)
0. Table of Contents:
The main goal of the TREC Video Retrieval Evaluation (TRECVID) is to
promote progress in content-based analysis of and retrieval from
digital video via open, metrics-based evaluation. TRECVID is a
laboratory-style evaluation that attempts to model real world
situations or significant component tasks involved in such
situations.
In 2006 TRECVID completed the second two-year cycle devoted to
automatic segmentation, indexing, and content-based retrieval of
digital video - broadcast news in English, Arabic, and Chinese. It
also completed two years of pilot studies on exploitation of unedited
video (rushes). Some 70 research groups have been provided with the
TRECVID 2005-2006 broadcast news video and many resources created by
NIST and the TRECVID community are available for continued research on
this data independent of TRECVID. See the "Past data" section of the
TRECVID website for pointers.
In 2007 TRECVID began exploring new data (cultural, news magazine,
documentary, and education programming) and an additional, new task -
video rushes summarization. In 2008 that work will continue with the
exception of the shot boundary detection task, which will be retired.
In addition TRECVID plans to organize two new task evaluations.
TRECVID 2008 will test systems on the following tasks:
surveillance event detection
- high-level feature extraction
- search (interactive, manually-assisted, and/or fully automatic)
- rushes summarization
content-based copy detection
For past participants, here are some changes to note:
- We expect to increase the number of topics for automatic search
runs to ~ 50. Manual and interactive runs will use a subset of 24 of
the 50. For automatic runs only, this will entail evaluating all
search runs using a 50% sample of the pooled submissions and inferred
average precision rather than average precision - as has been done for
the feature task since 2006.
- The upper limit for the duration of each interactive or manual
search will be reduced from 15 minutes to 10 minutes.
- The upper limit for video summaries will be reduced from 4% of
the duration of the full video to 2%.
- TRECVID will continue to emphasize search for events
(object+action) not easily captured in a single frame as opposed to
searching for static objects.
- While mastershots will be defined as units of evaluation,
keyframes or annotation of keyframes will not be provided by
NIST. This will require groups to look afresh at how best to train
their systems - tradeoffs between processing speed, effectiveness,
amount of the video processed. As in the past, participants may want
to team up to create training resources.
- The degree to which systems trained on broadcast news generalize
with varying amounts of training data to a related but different genre
will be a focus of TRECVID 2008.
A number of datasets are available for use in TRECVID 2008. We
describe them here and then indicate below which data will be used for
development versus test for each task.
Sound and Vision
-
The Netherlands Institute for
Sound and Vision has generously provided news
magazine, science news, news reports, documentaries, educational
programming, and archival video in MPEG-1 for use within TRECVID.
-
In 2007 we used about 50 hours for development and 50 hours for search
and feature test. These 100 hours will be available as development
data for the search and feature tasks in 2008. There will be about
another 100 hours for use as test data for the feature and search
tasks..
- ~ 100 hours for development of search and feature detection
- ~ 100 hours for test of search and feature detection
The copy detection task will use all 200 hours as test data; for
development data under "MUSCLE-VCD-2007" data below.
-
Tasks: search, feature extraction, and copy detection
-
Distribution: by download from password-protected
servers at NIST and elsewhere.
-
Training truth data for search and feature tasks:
- feature annotations of the 2007, 2005, and 2003
data are available from the "Past data" section of the TRECVID
website.
- a community effort to annotate the 2008 development data (for different features than in 2007) may be organized if there is sufficient interest.
-
Master shot reference:
Christian Petersohn at the Fraunhofer (Heinrich Hertz) Institute in
Berlin has again provided the master shot reference. Please use the
following reference in your papers:
C. Petersohn. "Fraunhofer HHI at TRECVID 2004: Shot Boundary Detection System",
TREC Video Retrieval Evaluation Online Proceedings, TRECVID, 2004
URL: www-nlpir.nist.gov/projects/tvpubs/tvpapers04/fraunhofer.pdf
Code developed by Peter Wilkins and Kirk Zhang at Dublin City
University will be used to format the reference. The method used in
2005/6 and to be repeated with the data for 2008 is described here.
-
Automatic speech recognition (Dutch): The University of Twente has offered to provide the output of an
automatic speech recognition system on the Sound and Vision data. Please use the
following reference in your papers:
Marijn Huijbregts, Roeland Ordelman and Franciska de Jong, Annotation
of Heterogeneous Multimedia Content Using Automatic Speech
Recognition. in Proceedings of SAMT, December 5-7 2007, Genova, Italy
-
Machine translation (Dutch to English): Christof Monz of Queen
Mary, University London will again contribute machine translation
(Dutch to English) for the Sound and Vision video (ASR output or
speech)
-
Keyframes: NIST will not be supplying keyframes for the Sound and Vision
video. This will require groups to look afresh at how best to train
their systems - tradeoffs between processing speed, effectiveness,
amount of the video processed.
-
Restrictions on the use of development and test data: You must read this
BBC rushes
-
The BBC Archive has provided unedited material in
MPEG-1 from about five dramatic series for use within TRECVID.
-
In 2007 we used about 18 hours (43 videos) for development and about
17 hours (42 videos) for testing. All of these videos and the
submitted summaries will be available as development data for
2008. There will be another 18 hours (40 videos) for use as test data
for video summarization.
- ~ 35 hours (57 clips) for rushes summarization development
- ~ 18 hours (40 clips) for rushes summarization test
-
Tasks: video summarization
-
Distribution: by download from password-protected
servers at NIST and elsewhere.
-
Training truth data: the submitted summaries from 2007 and the
ground truth for 2007 are available to participants.
TRECVID 2008 surveillance video
-
The UK Home Office Scientific Development Branch has provided
surveillance video for use by TRECVID in 2008. It comprises about 100
hours - the output of 5 cameras from the same period of 20 hours (2
hours per day over 10 days).
-
Tasks: surveillance event detection.
-
Distribution: on hard drives; possibly by download
-
Training truth data: annotations for training data will be provided
to participants.
Further information about the data is available here.
MUSCLE-VCD-2007
- For development data, the copy detection task will use the
MUSCLE-VCD-2007 data. This is the data that was used for the copy
detection evaluation at CIVR 2007
In order to be eligible to receive the data, you must have have
applied for participation in TRECVID. Your application will be
acknowledged by NIST with information about how to obtain the
data. Then you will need to complete the relevant permission forms
(from the active participant's area) and fax them (Attention: Lori
Buckland) to
in the US. Include a cover sheet with your fax that
identifies you, your organization, your email address, and each kind
of data you are requesting. Alternatively you may email a
well-identified pdf of each signed form to
Please ask only for the test data (and optional development data)
required for the task(s) you apply to participate in and intend to
complete. One permission form will cover 2007 and 2008 BBC data. One
permission form will cover 2007 and 2008 Sound and Vision data.
- Surveillance event detection -> TRECVID 2008 airport surveillance video
- development data
- test data
- Search, feature extraction, copy detection -> Sound and Vision
- test data
- optional for search and feature extraction: development data (= 2007 Sound and Vision development + test data)
- optional for copy detection: development data (available separately from MUSCLE-VCD-2007 webpage. Do not request from NIST.
- Rushes summarization -> BBC rushes
- test data
- optional: 2008 development data (= 2007 BBC development + test data)
The guidelines for this task have been developed with input from the
research community. Given 100 hours of surveillance video (50 hours
training, 50 hours test) the task is to detect 3 or more events from
the required event set and identify their occurrences
temporally. Systems can make multiple passes before outputting a list
of putative event observations (i.e., this is a retrospective
detection task). Besides the retrospective task, participants may
alternatively choose to do a "free style" analysis of the
data. Further information about the tasks may be found at the
following web sites:
Various high-level semantic features, concepts such as
"Indoor/Outdoor", "People", "Speech" etc., occur frequently in video
databases. The proposed task will contribute to work on a benchmark
for evaluating the effectiveness of detection methods for semantic
concepts
The task is as follows: given the feature test collection, the common
shot boundary reference for the feature extraction test collection,
and the list of feature definitions (see below), participants will return for each feature
the list of at most 2000 shots from the test collection, ranked
according to the highest possibility of detecting the presence of the
feature. Each feature is assumed to be binary, i.e., it is either
present or absent in the given reference shot.
All feature detection submissions will be made available to all
participants for use in the search task - unless the submitter
explicitly asks NIST before submission not to do this.
Description of high-level features to be detected:
The descriptions are those used in the common annotation effort. They
are meant for humans, e.g., assessors/annotators creating truth data
and system developers attempting to automate feature detection. They
are not meant to indicate how automatic detection should be
achieved.
If the feature is true for some frame (sequence) within the shot, then
it is true for the shot; and vice versa. This is a simplification
adopted for the benefits it affords in pooling of results and
approximating the basis for calculating recall.
NOTE: In the following, "contains x" is short for "contains x to a
degree sufficient for x to be recognizable as x to a human" . This
means among other things that unless explicitly stated, partial
visibility or audibility may suffice.
NOTE: NIST will instruct the assessors during the manual evaluation
of the feature task submissions as follows. The fact that a segment
contains video of physical objects representing the topic target, such
as photos, paintings, models, or toy versions of the topic target,
should NOT be grounds for judging the feature to be true for the
segment. Containing video of the target within video may be grounds
for doing so.
Selection of high-level features to be detected:
In 2008, participants in the high-level feature
task must submit results for all 20 of the following features. NIST will
then choose 10-20 of the features and evaluate submissions for
those.
The features were drawn from the large LSCOM feature set so as to be
appropriate to the Sound and Vision data used in the feature and
search tasks. Some feature definitions were enhanced for greater
clarity, so it is important that the TRECVID feature descriptions be
used and not the LSCOM descriptions.
Here is the final list of
features for evaluation together with their brief descriptions and
some general rules of interpretation
Search is high-level task which includes at least query-based
retrieval and browsing. The search task models that of an intelligence
analyst or analogous worker, who is looking for segments of video
containing persons, objects, events, locations, etc. of
interest. These persons, objects, etc. may be peripheral or accidental
to the original subject of the video. The task is as follows: given
the search test collection, a multimedia
statement of information need (topic), and the common shot
boundary reference for the search test collection, return a ranked
list of at most 1000 common reference shots from the test collection,
which best satisfy the need. Please note the following restrictions
for this task:
-
TRECVID 2008 will accept fully automatic search submissions (no human
input in the loop) as well as manually-assisted and interactive submissions as
illustrated graphically below
-
Because the choice of features and their combination for search is an
open research question, no attempt will be made to
restrict groups with respect to their use of features in
search. However, groups making manually-assisted runs should report
their queries, query features, and feature definitions.
- Every submitted run must contain a result set for each topic.
-
One baseline run will be required of every manually-assisted
system as well one for every automatic system
- A run based only on the text from the (English and/or Dutch)
ASR/MT output provided by NIST and on the text of the topics.
-
In order to maximize comparability within and across participating
groups, all manually-assisted runs within any given site must be
carried out by the same person.
-
An interactive run will contain one result for each and every topic, each
such result using the same system variant. Each result for a topic can come
from only one searcher, but the same searcher does not need to be used
for all topics in a run. Here are some suggestions for interactive experiments.
-
The searcher should have no experience of the topics beyond the general
world knowledge of an educated adult.
-
The search system cannot be trained, pre-configured, or otherwise tuned to the topics.
- The maximum total elapsed time limit for each topic (from the time
the searcher sees the topic until the time the final result set for
that topic is returned) in an interactive search run will be 10
minutes. For manually-assisted runs the manual effort (topic to query
translation) for any given topic will be limited to 10 minutes.
-
All groups submitting search runs must include the actual elapsed
time spent as defined in the videoSearchRunResult.dtd.
- Groups carrying out interactive runs should measure user
characteristics and satisfaction as well and report this with their
results, but they need not submit this information to NIST.
Here is some information about the questionnaires the Dublin City
University team used in 2004 to collect search feedback and
demographics from all groups doing interactive searching. Something
similar will be done again this year, with
details to be determined once participation is known.
In general, groups are reminded to use good experimental design
principles. These include among other things, randomizing the order
in which topics are searched for each run so as to balance learning
effects.
Supplemental interactive search runs, i.e., runs which do not
contribute to the pools but are evaluated by NIST, will be
allowed to enable groups to fill out an experimental design. Such runs
must not be mixed in the same submission file with non-supplemental
runs. This is the only sort of supplemental run that will be
accepted.
Rushes are the raw material (extra video, B-rolls footage) used to
produce a video. 20 to 40 times as much material may be shot as
actually becomes part of the finished product. Rushes usually have
only natural sound. Actors are only sometimes present. So very little
if any information is encoded in speech. Rushes contain many frames or
sequences of frames that are highly repetitive, e.g., many takes of
the same scene redone due to errors (e.g. an actor gets his lines
wrong, a plane flies over, etc.), long segments in which the camera is
fixed on a given scene or barely moving,etc. A significant part of the
material might qualify as stock footage - reusable shots of people,
objects, events, locations, etc. Rushes may share some characteristics
with "ground reconnaissance" video.
The system task in rushes summarization will be, given a video from
the rushes test collection, to automatically create an MPEG-1 summary
clip less than or equal to a maximum duration (to be determined) that
shows the main objects (animate and inanimate) and events in the
rushes video to be summarized. The summary should minimize the number
of frames used and present the information in ways that maximizes the
usability of the summary and speed of objects/event recognition.
Such a summary could be returned with each video found by a video
search engine much text search engines return short lists of keywords
(in context) for each document found - to help the searcher decide
whether to explore a given item further without viewing the whole
item. It might be input to a larger system for filtering, exploring
and managing rushes data.
Although in this task we limit the notion of visual summary to a
single clip that will be evaluated using simple play and pause
controls, there is still room for creativity in generating the
summary. Summaries need not be series of frames taken directly from
the video to be summarized and presented in the same order. Summaries
can contain picture-in-picture, split screens, and results of other
techniques for organizing the summary. Such approaches will raise
interesting questions of usability.
The summarization of BBC rushes will be run as a workshop at
the ACM Multimedia Conference in Vancouver, Canada during the last
week of October 2008.
As used here, a copy is a segment of video derived from another video,
usually by means of various transformations such as addition,
deletion, modification (of aspect, color, contrast, encoding, ...),
camcording, etc. Detecting copies is important for copyright control,
business intelligence and advertisment tracking, law enforcement
investigations, etc. Content-based copy detection offers an
alternative to watermarking. The TRECVID copy detection task will be
carried out in collaboration with members of the IMEDIA team at INRIA and will build on work
demonstrated at CIVR 2007.
Required task
The required system task will be as follows: given a test collection
of videos and a set of about 2000 queries (video-only segments),
determine for each query the place, if any, that some part of the
query occurs, with possible transformations, in the test collection.
The set of possible transformations will be based to the extent
possible on actually occurring transformations and will be published
by the time the guidelines are final.
Each query will be constructed using tools developed byIMEDIA to include some
randomization at various decision points in the construction of the
query set. For each query, the tools will take a segment from the test
collection, optionally transform it, embed it in some video segment
which does not occur in the test collection, and then finally apply
one or more transformations to the entire query segment. Some queries
may contain no test segment; others may be composed entirely of the
test segment. Transformations to be used will be published as part of
the guidelines after discussion amoung participants. Here is the
current plan for query
creation.
Optional tasks
Videos often contain audio. Sometimes the original audio is retained
in the copied material, sometimes it is replaced by a new
soundtrack. Nevertheless, audio is an important and strong feature for
some application scenarios of video copy detection. Since detection of
untransformed audio copies is relatively easy, and the primary
interest of the TV community is in video analysis, it was decided to
model the required CD task with video-only queries. However, since
audio is of importance for practical applications, there will be two
additional optional tasks: a task using transformed audio-only queries
and one using transformed audio+video queries.
The audio-only queries will be generated along the same lines as the
video-only queries: a set of 201 base audio-only queries is
transformed by several techniques that are intended to be typical of
those that would occur in real reuse scenarios: (1) bandwidth
limitation (2) other coding-related distortion (e.g. subband
quantization noise) (3) variable mixing with unrelated audio
content. The transformed queries will be downloadable from NIST.
The audio+video queries will consist of the aligned versions of
transformed audio and video queries, i.e, they will be various
combinations of transformed audio and transformed video from a given
base audio+video query. In this way sites can study
the effectiveness of their systems for individual audio and video
transformations and their combinations. These queries will not be
downloadable. Rather, NIST will provide a list of how to construct
each audio+video test query so that given the audio-only queries and
the video-only queries, sites can use a tool such as ffmpeg to
construct the audio+video queries.
Please watch the schedule for information soon about the sequence of query
releases and results due dates.
Please note: Only submissions which are valid when checked against
the supplied DTDs will be accepted. You must check your
submission before submitting it. NIST reserves the right to reject any
submission which does not parse correctly against the provided
DTD(s). Various checkers exist, e.g., Xerces-J: java sax.SAXCount -v YourSubmision.xml.
The results of the evaluation will be made available to attendees at
the TRECVID workshop and will be published in the final proceedings
and/or on the TRECVID website within six months after the
workshop. All submissions will likewise be available to interested
researchers via the TRECVID website within six months of the workshop.
Submissions
The guidelines for submission are currently being developed.
Further
information on submissions may be found here.
Evaluation
Output from systems will first be aligned to ground truth annotations,
then scored for misses / false alarms. Since error is a tradeoff
between probability of miss vs. rate of false alarms, this task will
use the Normalized Detection Cost Rate (NDCR) measure for evaluating
system performance. NDCR is a weighted linear combination of the
system's Missed Detection Probability and False Alarm Rate (measured
per unit time).
Further information about the evaluation measures may be found here.
Submissions
-
Each group may submit up to 6 total runs. All runs must be
prioritized and all will be evaluated.
-
TRECVID 2008 will require a feature run treating the new
video as if no automatic speech recognition (ASR) or machine
translation (MT) for the languages of the videos (mostly Dutch)
existed - as might occur in the case of video in other less well known
languages.
- For each feature in a run, participants will return at most
2000.
-
Here is a DTD for
feature extraction results of one run, one for results from multiple
runs, and a
small example of what a site would send to NIST for
evaluation. Please check your submission to see that it is well-formed
-
Submissions will be transmitted to NIST via a webpage. Details will
be available well before the submission deadline.
-
Each run must contain results for all features listed
above
Evaluation
-
The unit of testing and performance assessment will be the video shot
as defined by the track's common shot boundary reference. The
submitted ranked shot lists for the detection of each feature will be
judged manually as follows. We will take all shots down to some fixed
depth (in ranked order) from the submissions for a given feature -
using some fixed number of runs from each group in priority sequence
up to the median of the number of runs submitted by any group. We will
then merge the resulting lists and create a list of unique
shots. These will be judged manually down to some depth to be
determined by NIST based on available assessor time and number of
correct shots found. NIST will maximize the number of shots judged
within practical limits. We will then evaluate each submission to its
full depth based on the results of assessing the merged subsets. This
process will be repeated for each feature.
-
If the feature is perceivable by the
assessor for some frame (sequence) however short or long then, then
we'll assess it as true; otherwise false. We'll rely on the complex
thresholds built into the human perceptual systems. Search and feature
extraction applications are likely - ultimately - to face the complex
judgment of a human with whatever variability is inherent in
that.
- Runs will be compared based on a sample of the submission pools.
Precision-recall curves based on the sample will be used as well as inferred
average precision, which provides a good estimate of average
precision - a single-valued combination of precision, recall, and
ranking ability.
Submissions
-
Each group may submit up to 6 total runs. All runs must be prioritized
and all will be evaluated.
Each interactive run will contain one result for each and every topic
using the system variant for that run. Each result for a topic can
come from only one searcher, but the same searcher does not need to be
used for all topics in a run. If a site has more than one searcher's
result for a given topic and system variant, it will be up to the site
to determine which searcher's result is included in the submitted
result. NIST will try to make provision for the evaluation of
supplemental results, i.e., ones NOT chosen for the submission
described above. Details on this will be available by the time the
topics are released.
-
For manual and automatic systems, TRECVID 2008 will require:
- A run based only on the text from the (English and/or Dutch)
ASR/MT output provided by NIST and on the text of the topics.
- A run using no text from ASR/MT output - as though we were dealing
with video in a language for which ASR/MT was not available.
-
For each topic in a run, participants will return the list of at most
1000 shots.
Here is a DTD for
search results of one run, one for results from multiple
runs, and a
small example of what a site would send to NIST for
evaluation. Please check your submission to see that it is well-formed
-
Submissions will be transmitted to NIST via a webpage. Details will
be available well before the submission deadline.
Evaluation
- The unit of testing and performance
assessment will be the video shot as defined by the track's common
shot boundary reference. The submitted ranked lists of shots found
relevant to a given topic will be judged manually as follows. We will
take all shots down to some fixed depth (in ranked order) from the
submissions for a given topic - using some fixed number of runs from
each group in priority sequence up to the median of the number of runs
submitted by any group. We will then merge the resulting lists and
create a list of unique shots. A random sample of these pools will be
taken and this sample will be judged manually to some depth to be
determined by NIST based on available assessor time and number of
correct shots found. NIST will maximize the number of shots judged
within practical limits. We will then evaluate each submission to its
full depth based on the results of assessing the merged subsets. This
process will be repeated for each topic.
- Per-search measures:
- inferred average precision (definition below)
- elapsed time (for all runs)
- Per-run measure:
-
mean inferred average precision:
Runs will be compared based on a sample of the submission pools.
Precision-recall curves based on the sample will be used as well as inferred
average precision, which provides a good estimate of average
precision - a single-valued combination of precision, recall, and
ranking ability.
Non-interpolated average precision, corresponds to the area
under an ideal (non-interpolated) recall/precision curve. To compute
this average, a precision average for each topic is first calculated.
This is done by computing the precision after every retrieved relevant
shot and then averaging these precisions over the total number of
retrieved relevant/correct shots in the collection for that
topic/feature or the maximum allowed result set (whichever is
smaller). Average precision favors highly ranked relevant
documents. It allows comparison of different size result
sets. Submitting the maximum number of items per result set can never
lower the average precision for that submission. The topic averages
are combined (averaged) across all topics in the appropriate set
to create the non-interpolated mean average precision (MAP) for
that set. (See the TREC-10
Proceedings appendix on common evaluation measures for more
information.)
Submissions
-
Each participating group will be allowed to submit up to two runs, one
with priority 1 and one with priority 2. Depending on the number of
submissions, we may not be able to judge all submitted runs. If so we
will do them in priority order. If you do not need to submit two
distinct runs, please do not. Each run will contain one MPEG-1 summary
clip (same frame size and frame rate as the original videos) for each
of the test rushes videos along with the system time (in seconds)
needed to create the summary starting only with the video to be
summarized.
For practical reasons in planning the assessment we need an upper
limit on the size of the summaries. Also, some very long summaries
make no sense for a given use scenario. But you can imagine many
scenarios to motivate various answers. One might involve passing the
summary to downstream applications that support, clustering,
filtering, sophisticated browsing for rushes exploration, management,
reuse. Minimal emphasis on compression.
Assuming we want the summary to be directly usable by a human, then at
least the summary should be usable by a professional, looking for
reusable material, and willing to watch a summary longer than someone
with more recreational goals.
Therefore we'll allow longer summaries than a recreational user
would tolerate but score results so that systems that can meet a
higher goal (much shorter summary) get rewarded - e.g., present
mean-fraction-of-ground-truth-items-included versus
duration-of-the-summary or calculate ratio.
Each submitted summary will have a duration which is at most 2% of the
video to be summarized. Remember 2% is not a goal - it is just an
UPPER limit on size.
-
The primary method for submitting summaries to NIST will be as
follows. Each group will create one file containing a list of URLs -
one URL per line for each summary they are submitting. If the group is
submitting only one run then the URL file will contain 40 URL lines; if
two runs, then it will contain 80 URL lines.
The first two lines of the URL file will contain the userid (on line
1) and the password (on line 2) to be used to access the summaries. We
expect the summaries to be in a protected (non-spidered) directory.
The scheme in each URL can be "http" or "ftp". For example:
http://HOST/PATH/TO/SUMMARIES/1.MS237650.sum.mpg
Please name your test summaries *exactly* the same as the file
containing the video being summarized *except* with the priority ( "1"
or "2") and ".sum" inserted before the ".mpg". For example, the
priority 1 summary of test file MS237650.mpg should be called
1.MS237650.sum.mpg by every group. NIST will add a unique group prefix
here.
NIST will provide webpage each group can use to identify itself,
provide a contact email address, and type in the name of their URL
file for upload to NIST. Once the URL file has been uploaded, it will
be checked for simple errors and a message sent to the browser. After
that NIST will proceed to use the URLs to upload each summary. An
email will be sent to the submitting person as each summary is
uploaded. This will allow the submitter to see the progress and
provide detailed information about which, if any, uploads failed.
Although a little more complicated than last year, we hope the method
will be less labor-intensive than last year's method which required
each summary's name to be entered individually and uploaded before
going on to the next.
If you cannot make use of the primary submission method described
above you must notify NIST immediately so we can arrange for you to
use last year's method for submission. In which case, you will need
to leave more time for submission.
-
In the body of an email with your short team name in the
subject, please send to
the timing information for your summaries. At the top of the file
place the following information:
Operating system
CPU type
Memory
Then include a line for each summary with the elapsed time
in seconds to create that summary. For example:
Short_team_ID Time(s) Priority Video_being_summarized
Brno 469.36 1 MRS035126.mpg
Brno 443.74 1 MRS042538.mpg
Brno 573.94 1 MRS043405.mpg
Brno 665.83 1 MRS044497.mpg
Brno 470.94 1 MRS044499.mpg
Brno 869.20 1 MRS044725.mpg
Brno 369.14 1 MRS044728.mpg
...
Evaluation
-
At Dublin City University, with support from the European Commission
under contract FP6-027026 (K-Space), all the summary clips for a given
video will be viewed using mplayer on Linux in a window 125mm x 102mm
@ 25 fps in a randomized order by a single human judge. In a timed
process, the judge will play / pause the video as needed to determine
as quickly as possible which of the objects and events listed in the
ground truth for the video to be summarized are present in the
summary.
The judge will also be asked to assess the usability/quality of the
summary. Included will be at least something like the following with 5
possible answers for each - where only the extremes are labeled:
"Strongly agree" and "strongly disagree".
- The summary contains nearly identical segments.
- The summary contains color bars, clapboards, and/or totally black or totally white frames.
- The summary is presented in a pleasant tempo and rhythm.
This process will be repeated for each test video. If possible we will
have more than one human evaluate at least some of the videos.
- Per-summary measures:
- fraction of the ground truth objects/events found in the summary (to estimate recall)
- time (in seconds) needed to check summary against ground truth
- duration of the summary (to estimate precision and act as a normalization factor for other measures)
- system time (in seconds) to generate the summary
- usability/quality scores
- Per-system measures:
- Means of the above across all test videos (in relation to median/max for all systems
Carnegie Mellon University will again provide a simple baseline system to produce summaries within the 2% maximum. The baseline algorithm simply presents the entire video at 50x normal speed.
Submissions
A run is the output of a system (with a given set of parameters,
training, etc) executed against all of the queries appropriate for the
run type (video-only, audio-only, video+audio). A run will contain the
following information in the following order, all in ASCII, one item
per line unless otherwise indicated.
-
runId - an ASCII string of not more
than 10 characters and containing no whitespace, chosen by the
submitting group, identifying the run uniquely for the submitting
group. Note, that this Id DOES NOT identify the participating
group. NIST will add a separate group identifier. For example, a group
could simply use a digit ("1", "2", ...) or something more
explanatory ("sampled","full",...).
-
target detection metric - either
"F0.5" or "F2.0".
Participating groups must submit runs in pairs such that the 2 runs
in each pair differ only in the target measure - one run emphasizing
recall over precision (using the F2.0 measure for detection) and one
emphasizing precision over recall (using the F0.5 measure for
detection).
Each participating group may submit at most 3 pairs of runs for each
of the 3 run types.
- name of the operating system(s) used
- model of cpu used
- amount of memory available
-
one optimal confidence score threshold for all queries - a
confidenceScore, chosen for the run, such that results evaluated using
this threshold should be optimal in terms of the planned detection
metric (emphasizing recall (F2.0) or emphasizing precision (F0.5))
associated with the run.
-
table of processing times - one line for each query, where the
elapsed time to process the query and complete the search is an
integer representing seconds. Time spent analyzing the test collection
video before query processing begins will not be included.
queryId elapsedQueryProcessingTimeInSeconds
-
table of result items - the table will
comprise at most 329 result items for each final query (at most one
match per 328 reference videos + one "NONE found" match).
Each result item will include the following, in ASCII, arranged from
left to right, separated by one or more spaces, on one line of the run
file:
- queryId - a string, assigned by NIST, denoting the number of the
query and its type (video-only, audio-only, or audio+video)
- videoId - the file name of the reference video exactly as found in
the test data, e.g., BG_12332.mpg, not "bg_12332.MPG" or "BG_12332"
etc. The string "NONE" will be used once in each result to indicate
the system does not believe the query contains a copy from any
reference file. This allows the evaluation to distinguish between
empty results and results that found no copy.
- firstRefFrameTimeCode - the time code in the reference video of
the first frame of the found copy; if not found then set to 0
- lastRefFrameaTimeCode - the time code in the reference video of
the last frame of the found copy; if not found then set to 0
- confidenceScore - a real number, indicating the relative confidence
that the copy was (not) found for a given query. The higher the number,
the higher the confidence. Scores should be normalized so the meaning
of a given value is consistent across queries for the run. Such scores
may NOT be consistent in meaning across systems.
- firstQueryFrameTimeCode is the time code in the query of the
first frame of the found copy; if not found then set to 0
Note 1: Time codes will be expressed using just digits (0-9) and one
decimal point ("."), no other characters, and represent the
number of elapsed seconds since the start of the reference or
query video.
Note 2: Within a given result set for a given query, no videoId may
appear more than once. In other words, a run can return at
most one matching segment per reference video for each query.
Submissions will be transmitted to NIST via a webpage. Details will
be available well before the submission deadline.
Evaluation
- Systems will be evaluated on:
- How many queries they find the reference data for or correctly tell us
there is none to find. The reference data has been found if and only if:
- the asserted test video ID is correct and
- the extent of the asserted copy overlaps at least 50% of the extent of the
actual copy in the reference video
- When a copy is detected, how accurately they locate the reference data
in the test data
- How much elapsed time is required for query processing
- The following measures will be used:
- On whether the system detects the copy
For each run, results for each transformation will handled
separately. For each tranformation, all results will be concatenated
and sorted by confidence. Then precision and recall curves will be
created and the maximum F0.5/2.0 score found across standard recall
points. F2.0 weights recall twice as important as precision, while
F0.5 weights precision twice as important as recall. Depending on the
application, either measure may be considered as more appropriate. The
maximum F0.5/2.0 score will represent the optimal detection
performance of a run for a given transformation. F0.5/2.0 will also be
calculated at the threshold submitted with the run.Runs will be
submitted in pairs - one run trying to maximize F0.5, the other trying
to optimize F2.0.
- On how accurately the system finds the copy in the reference data
The asserted and actual extents of the copy in the reference data
will be comparied using precision and recall and these two numbers
will be combined using the F1 measure.
- On how long it takes the system to process a query
Mean time (in seconds) to process a query
The following are the target dates for 2008.
The schedule for the surveillance event detection task listed
at the end of this document
.
Just below is the proposed schedule for work on the BBC rushes
summarization task that will be held as a workshop at the ACM
Multimedia Conference in Vancouver, Canada during the last week of
October 2008. Results will be summarized at the TRECVID workshop in
November. Papers reporting participants' summarization that are not
included in the ACM Multimedia Worhshop proceedings should be
submitted for inclusioni in the TRECVID workshop notebook.
1 Apr test data available for download
5 May system output submitted to NIST for judging at DCU
1 Jun evaluation results distributed to participants
28 Jun papers (max 5 pages) due in ACM format
The organizers will provide an intro paper with information
about the data, task, groundtruthing, evaluation, measures, etc.
15 Jul acceptance notification
1 Aug camera-ready papers due via ACM process
31 Oct video summarization workshop at ACM Multimedia '08, Vancouver, BC, Canada
- 1. Feb
- NIST sends out Call for Participation in TRECVID 2008
- 22. Feb
- Applications for participation in TRECVID 2008 due at NIST
- 1 Mar
- Final versions of TRECVID 2007 papers due at NIST
- 1. Apr
- Guidelines complete
- 11. Apr
- Extended participant decision deadline for event detection task
- April
- Download of feature/search development data
- June
- Download of feature/search test data
- 30. June
- Video-only copy detection queries available for download
- 1. Aug
- Video-only copy detection submissions due at NIST for evaluation
Audio-only copy detection queries avilable for download
- 8. Aug
- Search topics available from TRECVID website.
- 15. Aug
- Feature extraction tasks submissions due at NIST for evaluation.
Feature extraction donations due at NIST
- 22. Aug
- Feature extraction donations available for active participants
- 25. Aug - 12. Sep
- Feature assessment at NIST
- 29. Aug
- Results of video-only copy detection evaluations returned to participants
Audio-only copy detection submissions due at NIST
Audio+video copy detection query plans available for download
- 12. Sep
- Search task submissions due at NIST for evaluation
- 19. Sep
- Results of feature extraction evaluations returned to participants
- 22. Sep - 10. Oct
- Search assessment at NIST
- 1. Oct
-
Audio+video copy detection submissions due at NIST for evaluation
Video-only and audio-only copy detection results returned to participants
- 9. Oct
-
Audio+video copy detection results returned to participants
- 15. Oct
- Results of search evaluations returned to participants
- 15. Oct
- Results of search evaluations returned to participants
- 20. Oct
- Speaker proposals due at NIST
- 27. Oct
- Notebook papers due at NIST
- 1. Nov
- Copyright forms due back at NIST (see Notebook papers for instructions)
- 10. Nov
- TRECVID 2008 Workshop registration closes
-
- 17,18 Nov
- TRECVID Workshop at NIST in Gaithersburg, MD
- 15. Dec
- Workshop papers publicly available (slides added as they arrive)
- 1. Mar 2009
- Final versions of TRECVID 2008 papers due at NIST
Here is a list of work items that must be completed before the guidelines
are considered to be final..
- Choose features to evaluate - more appropriate to S&V data, better definitions.
DONE. See above
- Poll participants for interest in community effort for creation of more/better S&V training data
UNDERWAY.
- Agree on final measures for copy-based detection task
DONE. See above.
- Decide how to handle search systems, if any, outside the rules (collaborative, etc.)
DONE. Few, if any, likely. Handle early on an individual basis.
- Should interactive and manual search results mark shots a human has directly selected?
DONE. No interest pro or con was expressed
so no change will be made to the search submissions.
- Decide whether for interactive search runs to require reporting of a search time offset for each shot found by a human searcher.
DONE. This is already available as an optional attribute for each search result time (see the search results dtd).
- Decide on best way to collect run metadata (in submission XML, separate web form, structured paper abstract,...)
DEFERRED until summer
-
Coordinators:
-
NIST contact:
-
Email lists:
- Information and discussion for active workshop participants
-
trecvid2008@nist.gov
-
archive open to active participants only
-
NIST will subscribe the contact listed in your application to
participate when we have received it. Additional members of active
participant teams will be subscribed by NIST if they send email to
indicating they want to be subscribed, the
email address to use, their name, and providing the TRECVID 2008
active participant's password. Groups may combine the information
for multiple team members in one email.
Once subscribed, you can post to this list by sending you thoughts as
email to trecvid2008@nist.gov, where they will be sent out to everyone
subscribed to the list, i.e., the other active participants.
- Information and discussion on the surveillance event detection task/li>
Last
updated: Wednesday, 14-May-08 09:48:42
Date created:
Tuesday, 3-Dec-07
For further information contact