MOTION-INDUCTIVE SELF-SUPERVISED OBJECT DIS-COVERY IN VIDEOS

Abstract

In this paper, we consider the task of unsupervised object discovery in videos. Previous works have shown promising results via processing optical flows to segment objects. However, taking flow as input brings about two drawbacks. First, flow cannot capture sufficient cues when objects remain static or partially occluded. Second, it is challenging to establish temporal coherency from flow-only input, due to the missing texture information. To tackle these limitations, we propose a model for directly processing consecutive RGB frames, and infer the optical flow between any pair of frames using a layered representation, with the opacity channels being treated as the segmentation. Additionally, to enforce object permanence, we apply temporal consistency loss on the inferred masks from randomly-paired frames, which refer to the motions at different paces, and encourage the model to segment the objects even if they may not move at the current time point. Experimentally, we demonstrate superior performance over previous state-of-the-art methods on three public video segmentation datasets (DAVIS2016, SegTrackv2, and FBMS-59), while being computationally efficient by avoiding the overhead of computing optical flow as input.

1. INTRODUCTION

Representing the visual scene with objects as the basic elements has long been considered a fundamental cognitive ability of the intelligent agent, for it enables understanding and interaction with the world more efficiently, for example, combinatorial generalization in novel settings (Tenenbaum et al., 2011) . Although it remains somewhat obscure at the level of neurophysiology on exactly how humans discover the objects in a visual scene in the first place, it is a consensus that motion seems to play an indispensable role in defining and discovering the objects from the scene. For example, in 1923, Wertheimer introduced the common fate principle that elements moving together tends to be perceived as a group (Wertheimer, 1923) ; while later Gibson claimed the independent motion has even been treated as one attribute to define an object visually (Gibson & Carmichael, 1966) . Grounded on the above assumptions, the recent literature has witnessed numerous works with different models proposed for segmenting the moving objects via unsupervised learning (Yang et al., 2019; 2021b; a; Liu et al., 2021) . Exploiting optical flows for object discovery naturally incurs two critical limitations: First, objects in videos may stop moving or be partially occluded at any time point, leaving no effective cues for their existence in the flow field; Second, computing optical flow from a pair of frames refers to a lossy encoding procedure, that poses a significant challenge for establishing temporal coherence, due to the lack of effective texture information. In contrast, adopting RGB frame sequences poses a few clear advantages. The most obvious one is that, while objects do not necessarily move all the time, the property of temporal coherence in RGB space naturally guarantees a preliminary understanding of object permanence; Additionally, the rich textures in the appearance stream give more distinctive patterns than those in motion, allowing to better identify and distinguish the different objects. Last but not least, processing RGB streams still enables a faster processing speed than using optical flow. In this paper, our goal is to train a video segmentation model that can discover the moving objects within a sequence of RGB frames, in the form of segmentation. In specific, our proposed model first encodes consecutive frames independently, into a set of frame-wise visual features, that is followed by a temporal fusion with a Transformer encoder. To localise the moving objects, we randomly In short, we summarize the contributions in this paper: First, we introduce the Motion-inductive Object Discovery (MOD) model, a simple architecture for discovering the moving objects in videos, by directly processing a set of consecutive RGB frames. Second, we propose a self-supervised proxy task that is used to train the architecture without relying upon any manual annotation. To overcome the challenge from flow-based methods, i.e., objects may stay static or move slowly, we adopt a random-paired policy and restrain the temporal consistency. Third, we conduct a series of ablation studies to validate each key component of our method, such as the temporal consistency of randompaired flow. While evaluating three public benchmarks, we demonstrate superior performance over existing approaches on DAVIS2016 (Perazzi et al., 2016 ), SegTrackv2 (Li et al., 2013 ), and FBMS-59 (Ochs et al., 2013) , with considerable speed-up during the inference procedure.

2. RELATED WORK

Video Object Segmentation. How to segment objects coherently in one video sequence has extended the topic of instance segmentation in the image. There is a great amount of work about video object segmentation (VOS) in recent decades (Caelles et al., 2017; Hu et al., 2017; Fan et al., 2019; Dutt Jain et al., 2017; Lai & Xie, 2019; Maninis et al., 2018; Oh et al., 2019; Voigtlaender et al., 2019; Caelles et al., 2017; Perazzi et al., 2017; Hu et al., 2018; Li & Loy, 2018; Bao et al., 2018; Voigtlaender et al., 2019; Johnander et al., 2019) . Recently, the research on getting rid of the dense annotation and designing more effective self-supervised algorithms has attracted more and more interest in the computer vision community including VOS (Xu & Wang, 2021; Jabri et al., 2020; Lai et al., 2020; Li et al., 2019; Vondrick et al., 2018; Lu et al., 2020; Wang et al., 2019; Kipf et al., 2022) . For VOS, there are two mainstream protocols to evaluate the learned model. One is semi-supervised video object segmentation, the other is unsupervised video object segmentation. Given the first-frame mask of the objects of interest, semi-supervised VOS tracks those objects in subsequent frames, while unsupervised VOS directly segments the most salient objects from the background without any reference. These two protocols are defined in the inference phase, meaning



Figure 1: Illustration about Bicycle Motocross (BMX) sequence on SegTrackv2 (Li et al., 2013). The red boxes and the yellow boxes refer to the arm and the back of the player, respectively. Flowonly method (Yang et al., 2021a) fails to track the same region in a temporal consistent fashion since it derives the foreground region directly from current optical flow. However, our methodology of processing a RGB video clip develops a sense of object permanence and solves the issue. pair the visual features from two frames and pass them into a frame comparator module, effectively establishing the relative motion between frames. Inspired by Yang et al. (2021a), we decode the motion features into optical flows with a dual-layered representation, with the opacity weight of each layer treated as the segmentation mask. At training time, we exploit an off-the-shelf optical flow estimator, e.g., RAFT (Teed & Deng, 2020), as the induction for flow reconstruction. To develop the property of object permanence, we enforce a temporal consistency on the inferred segmentation masks, which encourages the model to mine effective texture information from the RGB sequence and keep track of the objects even if they may be static at the current time point.

