MOTION-INDUCTIVE SELF-SUPERVISED OBJECT DIS-COVERY IN VIDEOS

Abstract

In this paper, we consider the task of unsupervised object discovery in videos. Previous works have shown promising results via processing optical flows to segment objects. However, taking flow as input brings about two drawbacks. First, flow cannot capture sufficient cues when objects remain static or partially occluded. Second, it is challenging to establish temporal coherency from flow-only input, due to the missing texture information. To tackle these limitations, we propose a model for directly processing consecutive RGB frames, and infer the optical flow between any pair of frames using a layered representation, with the opacity channels being treated as the segmentation. Additionally, to enforce object permanence, we apply temporal consistency loss on the inferred masks from randomly-paired frames, which refer to the motions at different paces, and encourage the model to segment the objects even if they may not move at the current time point. Experimentally, we demonstrate superior performance over previous state-of-the-art methods on three public video segmentation datasets (DAVIS2016, SegTrackv2, and FBMS-59), while being computationally efficient by avoiding the overhead of computing optical flow as input.

1. INTRODUCTION

Representing the visual scene with objects as the basic elements has long been considered a fundamental cognitive ability of the intelligent agent, for it enables understanding and interaction with the world more efficiently, for example, combinatorial generalization in novel settings (Tenenbaum et al., 2011) . Although it remains somewhat obscure at the level of neurophysiology on exactly how humans discover the objects in a visual scene in the first place, it is a consensus that motion seems to play an indispensable role in defining and discovering the objects from the scene. For example, in 1923, Wertheimer introduced the common fate principle that elements moving together tends to be perceived as a group (Wertheimer, 1923) ; while later Gibson claimed the independent motion has even been treated as one attribute to define an object visually (Gibson & Carmichael, 1966) . Grounded on the above assumptions, the recent literature has witnessed numerous works with different models proposed for segmenting the moving objects via unsupervised learning (Yang et al., 2019; 2021b; a; Liu et al., 2021) . Exploiting optical flows for object discovery naturally incurs two critical limitations: First, objects in videos may stop moving or be partially occluded at any time point, leaving no effective cues for their existence in the flow field; Second, computing optical flow from a pair of frames refers to a lossy encoding procedure, that poses a significant challenge for establishing temporal coherence, due to the lack of effective texture information. In contrast, adopting RGB frame sequences poses a few clear advantages. The most obvious one is that, while objects do not necessarily move all the time, the property of temporal coherence in RGB space naturally guarantees a preliminary understanding of object permanence; Additionally, the rich textures in the appearance stream give more distinctive patterns than those in motion, allowing to better identify and distinguish the different objects. Last but not least, processing RGB streams still enables a faster processing speed than using optical flow. In this paper, our goal is to train a video segmentation model that can discover the moving objects within a sequence of RGB frames, in the form of segmentation. In specific, our proposed model first encodes consecutive frames independently, into a set of frame-wise visual features, that is followed by a temporal fusion with a Transformer encoder. To localise the moving objects, we randomly 1

