AUTOSHOT: A SHORT VIDEO DATASET AND STATE-OF-THE-ART SHOT BOUNDARY DETECTION Anonymous

Abstract

The short-form videos have explosive popularity and have dominated the new social media trends. Prevailing short-video platforms, e.g., TikTok, Instagram Reels, and YouTube Shorts, have changed the way we consume and create content. For video content creation and understanding, the shot boundary detection (SBD) is one of the most essential components in various scenarios. In this work, we release a new public Short video sHot bOundary deTection dataset, named SHOT, consisting of 853 complete short videos and 11,606 shot annotations, with 2,716 high quality shot boundary annotations in 200 test videos. Leveraging this new data wealth, we propose to optimize the model design for video SBD, by conducting neural architecture search in a search space encapsulating various advanced 3D ConvNets and Transformers. Our proposed approach, named AutoShot, achieves higher F1 scores than previous state-of-the-art approaches, e.g., outperforming TransNetV2 by 4.2%, when being derived and evaluated on our newly constructed SHOT dataset. Moreover, to validate the generalizability of the AutoShot architecture, we directly evaluate it on another three public datasets: ClipShots, BBC and RAI, and the F1 scores of AutoShot outperform previous state-of-the-art approaches by 1.1%, 0.9% and 1.2%, respectively. The SHOT dataset and code will be released.

1. INTRODUCTION

Short-form videos have been widely digested among the entire age groups all over the world. The percentage of short videos and video-form ads has an explosive growth in the era of 5G, due to the richer contents, better delivery and more persuasive effects of short videos than the image and text modalities (Wang et al., 2021) . This strong trend leads to a significant and urgent demand for a temporally accurate and comprehensive video analysis in addition to a simple video classification category. Shot boundary detection is a fundamental component for temporally comprehensive video analysis and can be a basic block for various tasks, e.g., scene boundary detection (Rao et al., 2020; Chen et al., 2021 ), video structuring (Wang et al., 2021) , and event segmentation (Shou et al., 2021) . For instance, rewarded videos can be automated created of desired lengths for different platforms, leveraging the accurate shot boundary detection in the intelligent video creation. To accelerate the development of video temporal boundary detection, several datasets have been collected with laboriously manual annotation. Conventional shot boundary detection datasets, e.g., BBC Planet Earth Documentary series (Baraldi et al., 2015a) and RAI (Baraldi et al., 2015b) , only consist of documentaries or talk shows where the scenes are relatively static. Tang et al. (2018) further contribute a large-scale video shot database, ClipShots, consisting of different types of videos collected from YouTube and Weibo covering more than 20 categories, including sports, TV shows, animals, etc. Shou et al. ( 2021) construct a generic event boundary detection (GEBD) dataset, Kinetics-GEBD, which defines a clip as the moment where humans naturally perceive an event. Since the video lengths of short and conventional videos differ extensively, i.e., 90% short videos of length less than one minute versus videos in other datasets having length of 2-60 minutes as shown in Table 1 and Fig. 1 Right, it dramatically leads to significant content, display, temporal dynamics and shot transition differences as shown in Fig. 1 Left. A short video dataset is necessary to accelerate the development and proper evaluation of short video based shot boundary detection. On the other hand, several endeavors have been made to improve the accuracy of video shot boundary detection (SBD). DeepSBD (Hassanien et al., 2017) firstly applies a deep spatio-temporal ConvNet et al., 2018) and BBC (Baraldi et al., 2015a) . In this work, we firstly collect a short video dataset, named SHOT, consisting of 853 short videos with 11,606 manually shot boundary fine annotations. The 200 test videos with 2,716 shot boundary annotations are labeled by experts with two rounds. Leveraging this new data wealth, we aim to improve the accuracy of video shot boundary detection, by conducting neural architecture search in a search space encapsulating various advanced 3D ConvNets (Qiu et al., 2017) and Transformers (Vaswani et al., 2017) . Single path one-shot SuperNet strategy (Guo et al., 2020) and Bayesian optimization (Shahriari et al., 2015) are employed. The searched model, named AutoShot, outperforms TransNetV2 by 4.2% on our SHOT in terms of F1 score, and by 3.5% in terms of precision metric with a fixed recall rate as TransNetV2, respectively. We further evaluate the searched AutoShot architecture on ClipShots, BBC and RAI, and F1 score of AutoShot surpasses previous state-of-the-art approaches by 1.1%, 0.9% and 1.2%, respectively. Our contributions are summarized as follows: • We collect a short video shot boundary detection dataset (SHOT), which consists of 853 short videos and 11,606 shot boundary annotations. The SHOT will be released and can be employed to advance the development of various short video understanding tasks. • We design a video shot boundary detection search space encapsulating various advanced 3D ConvNets and Transformers, and build a neural architecture search pipeline for shot boundary detection. • The searched model, named AutoShot, proves to be a highly competitive shot boundary detection architecture, that significantly outperforms previous state-of-the-art approaches not only on its derived SHOT dataset, but also on other public benchmarks.



Figure 1: Left: Detecting a shot boundary can be a challenging task in short videos. The shot transition can be a combination of several complicated gradual transitions (the first row) and a quick transition of the subject in two shots (the second row). The visual effect of the intra-shot can vary greatly in game videos (the third row). Right: Video and shot length (s) comparison of test sets in ClipShots and our collected SHOT. There rarely has a video length range overlap between short videos in SHOT and test videos in ClipShots (up). The shot lengths of short videos are within six seconds, while the shot lengths of ClipShots can range from two seconds to 30 seconds (bottom).

