R-MONET: REGION-BASED UNSUPERVISED SCENE DECOMPOSITION AND REPRESENTATION VIA CONSIS-TENCY OF OBJECT REPRESENTATIONS

Abstract

Decomposing a complex scene into multiple objects is a natural instinct of an intelligent vision system. Recently, the interest in unsupervised scene representation learning emerged and many previous works tackle this by decomposing scenes into object representations either in the form of segmentation masks or position and scale latent variables (i.e. bounding boxes). We observe that these two types of representation both contain object geometric information and should be consistent with each other. Inspired by this observation, we provide an unsupervised generative framework called R-MONet that can generate objects geometric representation in the form of bounding boxes and segmentation masks simultaneously. While bounding boxes can represent the region of interest (ROI) for generating foreground segmentation masks, the foreground segmentation masks can also be used to supervise bounding boxes learning with the Multi-Otsu Thresholding method. Through the experiments on CLEVR and Multi-dSprites datasets, we show that ensuring the consistency of two types of representation can help the model to decompose the scene and learn better object geometric representations.



)) in the unsupervised scene decomposition and representation learning proves that a complex visual scene containing many objects can be properly decomposed without human labels. It proves that there is still much useful information that can be discovered in those unlabeled data. Recent approaches to address the unsupervised scene decomposition and representation learning can be categorized into two groups: models which explicitly acquire disentangled position and scale (i.e. bounding boxes) representation of objects (Eslami et al. ( 2016 2018)). In the former type of models, the scene is explicitly encoded into the object-oriented spatial encoding and appearance encoding. A decoder will generate the scene with explicitly defined object encoding for representation learning. This type of models cannot use rectangular bounding boxes to fully represent complex objects with flexible morphology. In the other type of models, the scene is decomposed into a finite number of object segmentation masks which can better represent complex objects with its pixel-topixel alignment. However, this type of models only use segmentation masks as the pixel-wise object mixture weights. They do not utilize the geometric information in the segmentation masks and still entangle object position and appearance representations in the scene generation step. Also, those models tend to decompose the entire scene in the image which does not use the locality benefit of objects. Inspired by the observation that foreground segmentation masks and bounding boxes both contain object geometric information and should be consistent with each other, a method called R-MONet (Region-based Multiple Object Net) is proposed in this paper. R-MONet uses the spirit of MONet (Burgess et al. ( 2019)) and S4Net (Fan et al. ( 2019)) by using a single stage, non-iterative network (spatial attention module) for generating object geometric representations in both bounding boxes and segmentation masks. Then, a variational autoencoder (VAE) (Kingma & Welling (2013) ) is used for encoding object appearance representations and regenerating the scene for training. To ensure the consistency between bounding boxes and foreground segmentation masks, the bounding boxes generated from spatial attention module is supervised with the pseudo bounding box generated by Multi-Otsu thresholding method (Liao et al. ( 2001)) on foreground segmentation masks. More than that, the foreground instance segmentation is only performed in the bounding box area instead of the full image to take advantage of the spatial locality and make scene generation less complex. The contributions of this paper are: -We introduce an effective single stage, non-iterative framework to generate object geometric representations in both bounding boxes and segmentation masks for unsupervised scene decomposition and representation learning. -We propose a self-supervised method that can better utilize object geometric information by ensuring the consistency between bounding boxes and foreground segmentation masks. This approach can improve the scene decomposition performance compared with the stateof-art. -We design a new segmentation head that can preserve global context and prevent coordinate misalignment in small feature maps which improves the foreground segmentation performance. 2020)) is the closest to our work in spirit. This model leverages the encoder similar to SPAIR to process foreground objects in parallel with explicit positional encoding and adapts the segmentation masks for background modeling. However, different from R-MONet, it does not use the shared information in both bounding boxes and segmentation masks.



, supervised object detection and segmentation (He et al. (2017); Ren et al. (2015); Fan et al. (2019); Liao et al. (2001); Lin et al. (2017); Ronneberger et al. (2015)) have made great progress with the extensive human labels. However, these supervised methods are still unable to take the advantage of massive unlabeled vision data. Unsupervised learning of scene representation starts to become a key challenge in computer vision. The breakthrough (Burgess et al. (2019); Greff et al. (2019); Eslami et al. (2016); Crawford & Pineau (2019); Engelcke et al. (2020), Greff et al. (2017); Van Steenkiste et al. (2018); Pathak et al. (2016); Lin et al. (

); Crawford & Pineau (2019); Lin et al. (2020)) and models implicitly encode objects' geometric representation into segmentation masks or entangle it with object appearance representations (Burgess et al. (2019); Greff et al. (2019); Engelcke et al. (2020); Greff et al. (2017); Van Steenkiste et al. (

There many influential works (Burgess et al. (2019); Greff et al. (2019); Eslami et al. (2016); Crawford & Pineau (2019); Engelcke et al. (2020); Greff et al. (2017); Van Steenkiste et al. (2018); Pathak et al. (2016); Lin et al. (2020)) in unsupervised scene decomposition in recent years. Some models tend to explicitly factor an object representation into spatial and appearance encodings such as 'what', 'where', 'presence', etc. with the help of VAE (Kingma & Welling (2013)). Influential related models include AIR (Eslami et al. (2016)) and its successor SPAIR(Crawford & Pineau  (2019)). AIR uses the Recurrent Neural Network as the encoder to decompose a complex scene into objects' representation but it suffers from the iteration speed. SPAIR improves its bounding box average precision and running speed by using Convolution Neural Network as the encoder to generate objects' representation in parallel. However, these models have not been tested on photorealistic 3D object dataset and bounding boxes can not fully represent flexible morphology like foreground segmentation masks.The other type of models tend to decompose each object into its own representation without explicit positional encoding and use segmentation masks to mix object reconstruction. Influential models such as MONet (Burgess et al. (2019)) which leverages a UNet (Ronneberger et al. (2015)) variant as an iterative attention network for segmentation mask generation and Spatial Broadcast Decoder (Watters et al. (2019)) for representation learning via scene reconstruction. Spatial Broadcast Decoder replaces deconvolutional network with transform by tiling (broadcast) the latent vector across space, concatenate fixed X-and Y-"coordinate" channels. This decoder provides better disentanglement between positional and non-positional features in the latent distribution. IODINE(Greff et al.  (2019)) tackles this problem with its amortized iterative refinement of foreground and background representation. However its iterative refinement process will heavily impact the speed of training and inference. GENESIS (Engelcke et al. (2020)) uses the similar idea as MONet but with different latent encoding in different iterative steps. These models all focus on decomposing the entire scene which does not leverage the spatial locality around each object.SPACE (Lin et al. (

