Video Hyperlinking

Task coordinators: Maria Eskevich, Roeland Ordelman

Professional and user generated multimedia content is stored in abundance by broadcasting companies and internet sharing platforms. The traditional way to provide access to these collections is via query-based search. However, in order to fully appreciate the content available to them in large archives, users need explorative ways to access content. The concept of Video Hyperlinking is proposed as a technology to enable this type of explorative access. In the longer term it might form the basis of a visual web that allows users to browse information in videos in the same manner as they do now in the textual web, jumping from one video to another.

The goal in video hyperlinking is to suggest relevant to related target video segments based on the multimodal contents of the anchor video segment. A hyperlink originates from a video segment that a user is currently watching. We call this starting segment an anchor, which is defined by a start and an end time within a video.

Here is an example picture giving an impression of video hyperlinking in a video segment on tourism in London: a part of the video where a Fish & Chips restaurant with it's logo being shown and discussed could be linked to a cooking program describing a recipe for Fish & Chips; while a later segment of the video that shows London Parliament and gives reference to the British Royal family could be linked to segments about England's Queen.

Relevance of a link target can be based upon topical information, the events or activities depicted, the people present in the videos, etc. However, finding similar target video segments given an anchor video segments is not the aim in video hyperlinking.

Use scenario:

Video Hyperlinking task in 2017 investigates videomaker verbal-visual information in semi-professional user generated content (SPUG), which is commonly found on the Web.

We introduce the notion of "verbal-visual information" in order to focus the task on two aspects:

Multimodality: The videomaker's use of a combination of the audio and visual channels to communicate the information.
Uploader Intent: The intent of the videomaker of the video to communicate information to the viewers. The focus on intent moves us away from information that is communicated incidentally by the video.

Definition of Verbal-Visual Information: Sometimes people communicate information verbally, using spoken language. Sometimes they communicate information visually, by showing something. However, sometimes information is communicated by a combination of speaking and showing. "Verbal-visual information" is defined as information whose communication depends on the exploitation of both the audio and video channels of the video. If someone only listens to the video some the information will not be fully communicated; conversely, if someone only watches the video, the information will not be fully communicated. To communicate verbal-visual information both modalities are critically necessary.

Definition of Videomaker: A "videomaker" is a user who creates video, also sometimes called the creator or the uploader. A videomaker is a semi-professional user if s/he has the goal to communicate a certain message to the audience, and is making use of conventional video production/editing techniques to do it. The videomaker does not necessarily make a living from creating video, and the content might also be less polished than professional content.

Definition of Intent: "Intent" is defined to be the goal or the purpose with which someone undertakes something. In this case, we are interested in the goal the videomaker was trying to achieve by creating the video. The motivation for considering videomaker intent to be important derives from the investigation of uploader intent on YouTube that as carried out in [Kofler2015]. Among the intent classes identified by the study are: "Convey knowledge", "teach practice", and "illustrate", which are all related to explanation, e.g., the communication of information. In [Kofler2015], some initial evidence for a symmetry between the intent of users uploading video and the intent of users searching for video is uncovered. The implication is that any research progress we can make towards techniques oriented towards uploader intent, will also directly benefit users who are searching for video.

Anchor Definition process: A person who knew the video collection well queried the collection for the linguistic cues, and then checked all the resulting videos by the keyframe. We attempted to keep a balance in the number of anchors that showed something happening with software on a computer screen. If there was a video which had exactly the same kind of uploader intent as had been used before we also skipped it. Otherwise, we made an anchor for every viable video where we felt that we could identify something that was meant to be shown and for which the visual and the spoken channel both contributed to communicating the information. We tried to make sure that the description of the anchor was concretely connected to its audio-visual content. However, if this connection was too literal there was a chance that no target segments would be found in the collection. The anchor selectors worked on getting the appropriate balance: but it was indeed a judgment call. Many of the anchors are going to be very challenging, but we hope that they are still interesting and will move the task into a new area.

Task:

Given a collection of videos with rich metadata and a set of anchor video segments, defined by a video ID and start time and end time, return a ranked list of potentially relevant target video segments, defined by a video ID and start time and end time.

Approaches for video hyperlinking typically operate in two steps [Ordelman2015b]: 1) anchor representation, which select a set of important features from an anchor, and 2) target search, which searches for the occurrence of these features in other video segments of the collection. For example, in the anchor representation step, a system might select the word sequence "fish and chips", or the appearance of the visual concept "red telephone booth" as important features of the anchor (the former could be found using named entity extraction and the later could be found by a corresponding visual concept detector). In the target search step, the system then searches for segments with occurrences of these features in the collection.

We would like to make the following remarks about this general approach: Current approaches often use one modality for search ("fish and chips" is searched in spoken content and "red telephone booth" is searched in visual content). In 2017, we encourage multimodal approaches for hyperlinking, e.g. searching for the mentioning and/or visual appearances of fish and chips, or the mentioning of red telephone booths and/or their appearance, as this is how the anchors were selected and defined. Similarity in important features is often a clue for the relevance of a link target. Note, however, that highly similar, especially (near-) duplicate, segments are likely to be non-relevant as they are of no utility in the considered use scenarios.

Data:

This year we will continue using the Blip10000 data set which consists of 14,838 videos for a total of 3,288 hours from blip.tv. The videos cover a broad range of topics and styles. It has automatic speech recognition transcripts provided by LIMSI; user-contributed metadata and shot boundaries provided by TU Berlin. Also, video concepts based on the MediaMil MED Caffe models are provided by EURECOM. The complete data set is made available in one package by the task organisers. Of course, task participants are welcome to use (and share) other metadata that you create.

Anchors are defined by content creators such as media-researchers. Besides the video plus the start and end times of an anchor, the content creators also provide assessment details of why they selected the anchor and what kind of targets they expect. This information will be used for assessment only and will not be provided to participants until the release of submissions evaluation results.

Submissions:

A team can submit up to four runs.
The mandatory format of each item returned by a run is similar to the usual TREC format: <qid> Q0 <videoid> <startTime> <end> <rank> <score> <RUNID>where <qid> is the ID of the anchor and time values have to be stated in X.Y format, where X.Y stands for X minutes and Y seconds since the start of the video (note that Y is not a fraction of X).
Targets pointing to the video of the anchor should be excluded and will be disregarded during the evaluation.
Targets for a given anchor must not overlap.
The filename of run has to contain a classification of the approach according to several criteria (features and segmentations used etc) and give a hint of the approach taken.
The filename, the format of the targets, and several constraints between targets (non-overlapping etc) have to be valid according to the publically available validation script
Participants can provide up to 1000 proposed link targets.
Runs will be submitted to a password-protected website at NIST and then forwarded by NIST to the task coordinator for evaluation.

Evaluation:

We will evaluate the relevance of the top results, as well as look into the diversity of the provided hyperlinks in the result list up to 1000 ranks.

The evaluation of the submissions will follow the Authored hyperlinking use scenario: Top ranked targets of participant submissions will be assessed using mechanical turk (MT) workers in 2 stages described in details in [Eskevich2017] and [HITsVH2016]. The primary reported effectiveness measure will be Precision at certain rank (5 or 10) and MAiSP [Racca2015]. Additionally we will report several traditional precision oriented measures, adapted to unconstrained time segments, see [Aly2013].

References / Suggested Reading

[Aly2013] Aly, R. and Eskevich, M. and Ordelman, R.J.F. and Jones, G.J.F., Adapting binary information retrieval evaluation metrics for segment-based retrieval tasks. arXiv preprint arXiv:1312.1913, 2013

[Aly2013b] Aly, R. and Ordelman, R.J.F. and Eskevich, M. and Jones, G.J.F. and Chen, S., Linking Inside a Video Collection - What and How to Measure? LiME workshop at the 22nd International Conference on World Wide Web Companion, IW3C2 2013, May 13-17, 2013, Rio de Janeiro, pp. 457-460.

[Eskevich2017] M. Eskevich, M. Larson, R. Aly, S. Sabetghadam, G.J.F. Jones, R. Ordelman, B. Huet. Multimodal Video-to-Video Linking: Turning to the Crowd for Insight and Evaluation. 23rd International Conference on MultiMedia Modeling (MMM), Reykjavík, Iceland, 2017.

[Jarvelin2002] Järvelin, K. and Kekäläinen, J., Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems (TOIS) 20.4 (2002): 422-446.

[Kofler2015] Kofler, C. and Bhattacharya, S. and Larson, M. and Chen, T. and Hanjalic, A. and Chang, S.F., Uploader Intent for Online Video: Typology, Inference, and Applications, in IEEE Transactions on Multimedia, vol. 17, no. 8, pp. 1200-1212, Aug. 2015.

[Ordelman2015] Ordelman, R.J.F. and Aly, R. and Eskevich, M. and Huet, B. and Jones, G.J.F., Convenient discovery of archived video using audiovisual hyperlinking. (2015).

[Ordelman2015b] Ordelman, R.J.F. and Eskevich, M. and Aly, R. and Huet, B. and Jones, G.J.F., Defining and Evaluating Video Hyperlinking for Navigating Multimedia Archives. In Proceedings of the 24th International Conference on World Wide Web Companion (pp. 727-732). International World Wide Web Conferences Steering Committee. (2015, May).

[Racca2015] Racca, D. N. and Jones, G.J.F., Evaluating Search and Hyperlinking: an example of the design, test, refine cycle for metric development. In Working Notes Proceedings of the MediaEval 2015 Workshop, Wurzen, Germany, 2015.

[SHEval2015] https://github.com/robinaly/sh_eval

[HITsVH2016] https://github.com/meskevich/Crowdsourcing4Video2VideoHyperlinking/

Video Hyperlinking Website