GLOBAL PROTOTYPE ENCODING FOR INCREMENTAL VIDEO HIGHLIGHTS DETECTION Anonymous

Abstract

Video highlights detection (VHD) is an active research field in computer vision, aiming to locate the most user-appealing clips given raw video inputs. However, most VHD methods are based on the closed world assumption, i.e., a fixed number of highlight categories is defined in advance and all training data are available beforehand. Consequently, existing methods have poor scalability with respect to increasing highlight domains and training data. To address above issues, we propose a novel video highlights detection method named Global Prototype Encoding (GPE) to learn incrementally for adapting to new domains via parameterized prototypes. To facilitate this new research direction, we collect a finely annotated dataset termed LiveFood, including over 5,100 live gourmet videos that consist of four domains: cooking, eating, ingredients and presentation. To the best of our knowledge, this is the first work to explore video highlights detection in the incremental learning setting, opening up new land to apply VHD for practical scenarios where both the concerned highlight domains and training data increase over time. We demonstrate the effectiveness of GPE through extensive experiments. Notably, GPE surpasses popular domain-incremental learning methods on LiveFood, achieving significant mAP improvements on all domains. The code and dataset will be made publicly available.

1. INTRODUCTION

The popularization of portable devices with cameras greatly promotes the creation and broadcasting of online videos. These sufficient video data serve as essential prerequisites for relevant researches, e.g. video summarization (Potapov et al., 2014; Song et al., 2015; Zhang et al., 2018; Fajtl et al., 2018; Zhu et al., 2021) , video highlights detection (VHD) (Yang et al., 2015; Xiong et al., 2019; Lei et al., 2021; Bhattacharya et al., 2021) , and moment localization (Liu et al., 2018; Zhang et al., 2020; Rodriguez et al., 2020) , to name a few. Currently, most VHD methods are developed under the closed world assumption, which requires both the number of highlight domains and the size of training data to be fixed in advance. However, as stated in Rebuffi et al. (2017) , natural vision systems are inherently incremental by consistently receiving new data from different domains or categories. Taking the gourmet video as an example, in the beginning, one may be attracted by the clips of eating foods, but lately, he/she may raise new interests in cooking and want to checkout the detailed cooking steps in the same video. This indicates that the target set the model needs to handle is flexible in the open world. Under this practical setting, all existing VHD methods suffer from the scalability issue: they are unable to predict both the old and the newly added domains, unless they retrain models on the complete dataset. Since the training cost on videos is prohibitive, it is thus imperative to develop new methods to deal with the above incremental learning issues. Broadly speaking, there exist two major obstacles that hinder the development of incremental VHD: a high-quality VHD dataset with domain annotations and strong models tailored for this task. Recall existing datasets that are widely used in VHD research, including SumMe (Gygli et al., 2014 ), TVSum (Song et al., 2015 ), Video2GIF (Gygli et al., 2016) , PHD (Garcia del Molino & Gygli, 2018), and QVHighlights (Lei et al., 2021) , all of them suffer from threefold drawbacks: (1) only the feature representations of video frames are accessible instead of the raw videos, thus restricting the application of more powerful end-to-end models; (2) most datasets only have a limited number of videos with short duration and coarse annotations, which are insufficient for training deep models; (3) none of them has the video highlight domain or category labels, thus can not be directly used in incremental learning. In order to bridge the gap between VHD and incremental learning, we first collect a high-quality gourmet dataset from live videos, namely LiveFood. It contains over 5,100 carefully selected videos with 197 hours in total. Four domains are finely annotated, i.e., cooking, eating, ingredients and presentation. These related but distinctive domains provide a new test bed for incremental VHD tasks. To solve this new task, we propose a competitive model: Global Prototype Encoding (GPE) to learn new highlight concepts incrementally while still retaining knowledge learned in previous video domains/data. Specifically, GPE first extracts frame-wise features using a CNN, then employs a transformer encoder to aggregate the temporal context to each frame feature, obtaining temporalaware representations. Furthermore, each frame is classified by two groups of learnable prototypes: highlight prototypes and vanilla prototypes. With these prototypes, GPE optimizes a distance-based classification loss under L 2 metric and encourages incremental learning by confining the learned prototypes in new domains to be close to that previously observed. We systematically compare the GPE with different incremental learning methods on LiveFood. Experimental results show that GPE outperforms other methods on highlight detection accuracy (mAP) with much better training efficiency, using no complex exemplar selection or complicated replay schemes, strongly evidencing the effectiveness of GPE. The main contributions of this paper are summarized as follows: • We introduce a new task named incremental video highlights detection, which has important applications in practical scenarios. A high-quality LiveFood dataset is collected to facilitate research in this direction. LiveFood comprises over 5,100 carefully selected gourmet videos in high resolution, providing a new test bed for video highlights detection and domain-incremental learning tasks. • We propose a novel end-to-end model for solving incremental VHD, i.e., Global Prototype Encoding (GPE). GPE can incrementally identify highlight and vanilla frames in new highlight domains via learning extensible and parameterized highlight/vanilla prototypes. GPE achieves superior performance compared with other incremental learning methods, improving the detection performance (mAP) by 1.57% on average. The above results suggest that GPE can serve as a strong baseline for future research. • We provide comprehensive analyses of LiveFood as well as the proposed GPE model for deepening the understanding of both, as well as giving helpful insight for future development. We hope our work can inspire more researchers to work in incremental VHD, finally pushing forward the application of VHD in practical scenarios.

2. RELATED WORK

Video Highlights Detection (VHD) is an important task in video-related problems. This line of research can be roughly divided into two groups, namely the ranking-based and regression-based



Figure 1: The LiveFood dataset. The row from top to bottom illustrates examples of vanilla clips, ingredients and presentation respectively. More samples are attached in Appendix A.2.

