TREC 2005 Video Retrieval Evaluation Inputs to TRECVID 2006 planning and some initial responses o High-level features * Push development of generic approaches to detector development by requiring groups to submit results for all features in the common annotation. NIST to choose a subset for evaluation? ->It seems time to adopt this requirement. Almost all groups participating in the 2005 high-level feature task, submitted runs for all features. ? There remains the issue of deciding well in advance of the evaluation, how NIST will choose the subset of features to be evaluated manually. * Collect and report information on computational effort? ->At this point in the maturity of the technologies, we don't want to discourage approaches that are slower but effective. Also, training time is difficult to define because training may involve multiple cycles of training, testing, and system revision. Nevertheless, reporting computation time during final system test as supporting information about a run, seems useful enough to require it. * Require run based on a donated set of low-level features (CMU's, MPEG-7's,...) for at least master shot keyframes? ->It's not clear that the benefit of requiring such a run from everyone, would outweigh the cost (yet another pre-defined run type) or how many participants are really interested. We could allow submissions to let us know if they restricted their use of low-level features to the "official set(s)". * Provide a common discriminantive modeling baseline system? ->No proposal o Camera motion ->Although we had included camera motion in a few search topics or features in previous years, in 2005 we devoted a full task to it and learned a lot - not least about the difficulty of creating the truth data. As a result, we do not intend to continue this task as a separate focus in 2006 though camera motion can continue to be part of topics in the search task and perhaps find a natural place in a task using unproduced video. * Use finer granularity than master shots? Perhaps a random sample of arbitrary segments of some small abitrary length? ->Seems length should be tied to the lengths of real segments people could expect to find in an archive they are searching for video material. * Change task significantly * Make this a quantification, not a binary classification task? ->It is already very expensive to create truth data; this would make it worse. Also, it is pretty clear from real queries and input from some real users that this is not how real users specify movement in their requests * Define minimum motion? ->Not appropriate or practical. See above. * Include search for "change of focus" etc.? ->While "change of focus" is ill-defined for the purposes of evaluation, it might be interesting to use a real archival system's set of shot type features in place of simple motion features. However, this runs into the cost of creating truth data (if it is not given) and we have no idea how many systems might be able to participate. o Rushes * There was considerable interest in continuing work on unproduced video in some form. There is significant commercial interest and it is hoped some of the research results may apply to surveillance video. ->Hope to have significantly larger amount of rushes from BBC but probably without metadata. Possibility we will also have similar extra video material from other sources. Beyond getting the data in a usable format with permission for its use by TRECVID, there are MANY questions to be answered about the evalution. What is the real task being modeled, where can get training and ground truth or judgment resources,etc. * Task * Stock shot finding? ->Known to be a useful task but not clear how to evaluate it, since the concept of "stock shot" is not well enough defined. * Known item search using skimming/summarization/clustering/exploration? ->Topic creation is probably doable for NIST if we can convince ourselves there is only one answer for each topic, but we would need a large number of topics. But what is the real task we are modeling? A person who WAS familiar with the material but has forgotten how to find the segment s/he remembers? What do the systems return? A point in time in a video such that the known item occurs no more than x secs after the given starting point? * Direct evaluation of clustering analysis (a la TDT?) ? Less like clustering, more like assignment of labels to (existing?) clusters? ->Details unclear * Given a clip from outsite the collection, find similar ones in the test collection? ->Problem: deciding whether two clips are similar is extremely subjective - much more so than deciding if a clip contains an object, event, person, etc. * Other comments: Look at similarities to UAV data look for events? Need info on real user task characteristics. Will someone donate camera motion detection data? No common unit of retrieval? o Search * Rename "manual" as "automated" ->Not a problem * More info in the overview on effect of various factors (e.g., text-only vs text+video) ->Trying this already in the 2005 overview paper * Drop manual search type? ->Harm is not clear, since manual runs are not required but provides way of estimating topic-to-query translation effect within search. * Evaluate automatic runs using precision/recall at fixed depth? ->We already include such partial measures in the run results pages but we do not use them as featured overall (precision and recall) measures because they do not reflect recall and do not average well over topics with varying numbers of relevant shots. * Use the LSCOM use cases to generate TV6 topics? ->It is problematic starting from a list of interesting events, people, etc. and then testing them to see which occur often enough in the development and test data. We have no efficient means to do this. We tried something similar before starting from the NYT daily news digest and spent most of our time chasing down deadends. It would be possible to have the LSCOM topics in mind while watching the development/test video in search of candidate topic targets. * Use ellicit multi-valued relevance judgments from assessors? ->No clear benefit presented * Need volunteer to define master shot boundaries -> Christian Petersohn at the Fraunhofer (Heinrich Hertz) Institute is willing to do this again for continuity. Thanks! * Need volunteer to complete master shot reference and extract keyframes. -> DCU is willing to do this again for continuity. Thanks! o General * Assess impact of "near duplicates"? Remove commercials from test sets? ->This has been an open issue for some time. NIST is looking at this question. It's not clear that removing commercials from the development/collection is the right answer, if only because real systems will need to deal with commercials in some way or other. * Require rerun of 2005 system on 2006 data to assess data effect? or rerun of 2006 system on 2005 data. ->While it would be great if all participants could do this, we hesitate to require everyone. At least though, one might expect VACE/DTO contractors to do this and information from from their runs might help estimate the data effect for all systems. * Emphasize video aspects (vs static) ->Not clear what was intended here. 1) Is this a change (in emphasis?) from what representatives of the intelligence community have told us previously? - namely that they are very interested in being able to find objects, people, and locations, not exclusively or mainly events. 2) Is this a suggestion or requirement for researchers to approach the task using more than keyframes to represent the video? * New automatic "near duplicate" detection task ->Not sure how to define this. Find clips that are very similar at a low level because they contain some of the very same decoded frames (in the same order?)? Or are we talking (too) about higher order similarity, e.g., shots that film the same event from a different angle? * Location? * NIST pre-TREC on 13-14 November, in Gaithersburg, MD