RETHINKING CONTENT AND STYLE: EXPLORING BIAS FOR UNSUPERVISED DISENTANGLEMENT

Abstract

Content and style (C-S) disentanglement intends to decompose the underlying explanatory factors of objects into two independent latent spaces. Aiming for unsupervised disentanglement, we introduce an inductive bias to our formulation by assigning different and independent roles to content and style when approximating the real data distributions. The content embeddings of individual images are forced to share a common distribution. The style embeddings encoding instancespecific features are used to customize the shared distribution. The experiments on several popular datasets demonstrate that our method achieves the state-of-theart disentanglement compared to other unsupervised approaches and comparable or even better results than supervised methods. Furthermore, as a new application of C-S disentanglement, we propose to generate multi-view images from a single view image for 3D reconstruction.

1. INTRODUCTION

The disentanglement task aims to recover the underlying explanatory factors of natural images into different dimensions of latent space, and provide an informative representation for tasks like image translation (Wu et al., 2019b; Kotovenko et al., 2019 ), domain adaptation (Li et al., 2019; Zou et al., 2020) and geometric attributes extraction (Wu et al., 2019c; Xing et al., 2019) , etc. The previous methods (Kim & Mnih, 2018; Higgins et al., 2017; Burgess et al., 2018a; Kumar et al., 2017) learn disentangled factors by optimizing the total correlation in an unsupervised manner. However, Locatello et al. (2019) prove that unsupervised disentanglement is fundamentally impossible without inductive bias on both model and data. In this paper, we focus on content and style (C-S) disentanglement, where content and style represent two separate groups of factors. The main novelty of our work is that we assign different roles to the content and style in modeling the image distribution instead of treating the factors equally, which is the inductive bias introduced in our method. Most of the previous C-S disentanglement works (Denton & Birodkar, 2017; Jha et al., 2018; Bouchacourt et al., 2018; Gabbay & Hoshen, 2020) rely on supervision, which is hard to obtain for real data. E.g., Gabbay & Hoshen (2020) leverage group observation to achieve disentanglement by forcing images from the same group to share a common embedding. To our best knowledge, the only exception is Wu et al. (2019c). However, this method forces the content path to learn geometric structure limited by 2D landmarks. Our definition of content and style is similar to Gabbay & Hoshen (2020) , where the content includes the information which can be transferred among groups and style is image-specific information. When group observation is not available, we define content includes the factors shared across the whole dataset, such as pose. Take the human face dataset CelebA (Liu et al., 2015) as an example, the content encodes pose, and style encodes identity, and multi-views of the same identity have the same style embeddings, but different content embeddings, i.e., poses. Based on the above definitions, we propose a new problem formulation and network architecture by introducing an inductive bias: assigning different and independent roles to content and style when approximating the real data distributions. Specifically, as shown in Figure 1 , we force the content embeddings of individual images to share a common distribution, and the style embeddings are used to scale and shift the common distribution to match target image distribution via a generator.

