GLOBAL PROTOTYPE ENCODING FOR INCREMENTAL VIDEO HIGHLIGHTS DETECTION Anonymous

Abstract

Video highlights detection (VHD) is an active research field in computer vision, aiming to locate the most user-appealing clips given raw video inputs. However, most VHD methods are based on the closed world assumption, i.e., a fixed number of highlight categories is defined in advance and all training data are available beforehand. Consequently, existing methods have poor scalability with respect to increasing highlight domains and training data. To address above issues, we propose a novel video highlights detection method named Global Prototype Encoding (GPE) to learn incrementally for adapting to new domains via parameterized prototypes. To facilitate this new research direction, we collect a finely annotated dataset termed LiveFood, including over 5,100 live gourmet videos that consist of four domains: cooking, eating, ingredients and presentation. To the best of our knowledge, this is the first work to explore video highlights detection in the incremental learning setting, opening up new land to apply VHD for practical scenarios where both the concerned highlight domains and training data increase over time. We demonstrate the effectiveness of GPE through extensive experiments. Notably, GPE surpasses popular domain-incremental learning methods on LiveFood, achieving significant mAP improvements on all domains. The code and dataset will be made publicly available.

1. INTRODUCTION

The popularization of portable devices with cameras greatly promotes the creation and broadcasting of online videos. These sufficient video data serve as essential prerequisites for relevant researches, e.g. video summarization (Potapov et al., 2014; Song et al., 2015; Zhang et al., 2018; Fajtl et al., 2018; Zhu et al., 2021) , video highlights detection (VHD) (Yang et al., 2015; Xiong et al., 2019; Lei et al., 2021; Bhattacharya et al., 2021) , and moment localization (Liu et al., 2018; Zhang et al., 2020; Rodriguez et al., 2020) , to name a few. Currently, most VHD methods are developed under the closed world assumption, which requires both the number of highlight domains and the size of training data to be fixed in advance. However, as stated in Rebuffi et al. ( 2017), natural vision systems are inherently incremental by consistently receiving new data from different domains or categories. Taking the gourmet video as an example, in the beginning, one may be attracted by the clips of eating foods, but lately, he/she may raise new interests in cooking and want to checkout the detailed cooking steps in the same video. This indicates that the target set the model needs to handle is flexible in the open world. Under this practical setting, all existing VHD methods suffer from the scalability issue: they are unable to predict both the old and the newly added domains, unless they retrain models on the complete dataset. Since the training cost on videos is prohibitive, it is thus imperative to develop new methods to deal with the above incremental learning issues. Broadly speaking, there exist two major obstacles that hinder the development of incremental VHD: a high-quality VHD dataset with domain annotations and strong models tailored for this task. Recall existing datasets that are widely used in VHD research, including SumMe (Gygli et al., 2014 ), TVSum (Song et al., 2015 ), Video2GIF (Gygli et al., 2016) , PHD (Garcia del Molino & Gygli, 2018), and QVHighlights (Lei et al., 2021) , all of them suffer from threefold drawbacks: (1) only the feature representations of video frames are accessible instead of the raw videos, thus restricting the application of more powerful end-to-end models; (2) most datasets only have a limited number of videos with short duration and coarse annotations, which are insufficient for training deep models; (3) none of them has the video highlight domain or category labels, thus can not be directly used in

