R-MONET: REGION-BASED UNSUPERVISED SCENE DECOMPOSITION AND REPRESENTATION VIA CONSIS-TENCY OF OBJECT REPRESENTATIONS

Abstract

Decomposing a complex scene into multiple objects is a natural instinct of an intelligent vision system. Recently, the interest in unsupervised scene representation learning emerged and many previous works tackle this by decomposing scenes into object representations either in the form of segmentation masks or position and scale latent variables (i.e. bounding boxes). We observe that these two types of representation both contain object geometric information and should be consistent with each other. Inspired by this observation, we provide an unsupervised generative framework called R-MONet that can generate objects geometric representation in the form of bounding boxes and segmentation masks simultaneously. While bounding boxes can represent the region of interest (ROI) for generating foreground segmentation masks, the foreground segmentation masks can also be used to supervise bounding boxes learning with the Multi-Otsu Thresholding method. Through the experiments on CLEVR and Multi-dSprites datasets, we show that ensuring the consistency of two types of representation can help the model to decompose the scene and learn better object geometric representations.



)) in the unsupervised scene decomposition and representation learning proves that a complex visual scene containing many objects can be properly decomposed without human labels. It proves that there is still much useful information that can be discovered in those unlabeled data. Recent approaches to address the unsupervised scene decomposition and representation learning can be categorized into two groups: models which explicitly acquire disentangled position and scale (i. 2018)). In the former type of models, the scene is explicitly encoded into the object-oriented spatial encoding and appearance encoding. A decoder will generate the scene with explicitly defined object encoding for representation learning. This type of models cannot use rectangular bounding boxes to fully represent complex objects with flexible morphology. In the other type of models, the scene is decomposed into a finite number of object segmentation masks which can better represent complex objects with its pixel-topixel alignment. However, this type of models only use segmentation masks as the pixel-wise object mixture weights. They do not utilize the geometric information in the segmentation masks and still entangle object position and appearance representations in the scene generation step. Also, those models tend to decompose the entire scene in the image which does not use the locality benefit of objects. 1



, supervised object detection and segmentation (He et al. (2017); Ren et al. (2015); Fan et al. (2019); Liao et al. (2001); Lin et al. (2017); Ronneberger et al. (2015)) have made great progress with the extensive human labels. However, these supervised methods are still unable to take the advantage of massive unlabeled vision data. Unsupervised learning of scene representation starts to become a key challenge in computer vision. The breakthrough (Burgess et al. (2019); Greff et al. (2019); Eslami et al. (2016); Crawford & Pineau (2019); Engelcke et al. (2020), Greff et al. (2017); Van Steenkiste et al. (2018); Pathak et al. (2016); Lin et al. (

e. bounding boxes) representation of objects (Eslami et al. (2016); Crawford & Pineau (2019); Lin et al. (2020)) and models implicitly encode objects' geometric representation into segmentation masks or entangle it with object appearance representations (Burgess et al. (2019); Greff et al. (2019); Engelcke et al. (2020); Greff et al. (2017); Van Steenkiste et al. (

