RETHINKING CONTENT AND STYLE: EXPLORING BIAS FOR UNSUPERVISED DISENTANGLEMENT

Abstract

Content and style (C-S) disentanglement intends to decompose the underlying explanatory factors of objects into two independent latent spaces. Aiming for unsupervised disentanglement, we introduce an inductive bias to our formulation by assigning different and independent roles to content and style when approximating the real data distributions. The content embeddings of individual images are forced to share a common distribution. The style embeddings encoding instancespecific features are used to customize the shared distribution. The experiments on several popular datasets demonstrate that our method achieves the state-of-theart disentanglement compared to other unsupervised approaches and comparable or even better results than supervised methods. Furthermore, as a new application of C-S disentanglement, we propose to generate multi-view images from a single view image for 3D reconstruction.

1. INTRODUCTION

The disentanglement task aims to recover the underlying explanatory factors of natural images into different dimensions of latent space, and provide an informative representation for tasks like image translation (Wu et al., 2019b; Kotovenko et al., 2019) , domain adaptation (Li et al., 2019; Zou et al., 2020) and geometric attributes extraction (Wu et al., 2019c; Xing et al., 2019) , etc. The previous methods (Kim & Mnih, 2018; Higgins et al., 2017; Burgess et al., 2018a; Kumar et al., 2017) learn disentangled factors by optimizing the total correlation in an unsupervised manner. However, Locatello et al. (2019) prove that unsupervised disentanglement is fundamentally impossible without inductive bias on both model and data. In this paper, we focus on content and style (C-S) disentanglement, where content and style represent two separate groups of factors. The main novelty of our work is that we assign different roles to the content and style in modeling the image distribution instead of treating the factors equally, which is the inductive bias introduced in our method. Most of the previous C-S disentanglement works (Denton & Birodkar, 2017; Jha et al., 2018; Bouchacourt et al., 2018; Gabbay & Hoshen, 2020) rely on supervision, which is hard to obtain for real data. E.g., Gabbay & Hoshen (2020) leverage group observation to achieve disentanglement by forcing images from the same group to share a common embedding. To our best knowledge, the only exception is Wu et al. (2019c) . However, this method forces the content path to learn geometric structure limited by 2D landmarks. Our definition of content and style is similar to Gabbay & Hoshen (2020) , where the content includes the information which can be transferred among groups and style is image-specific information. When group observation is not available, we define content includes the factors shared across the whole dataset, such as pose. Take the human face dataset CelebA (Liu et al., 2015) as an example, the content encodes pose, and style encodes identity, and multi-views of the same identity have the same style embeddings, but different content embeddings, i.e., poses. Based on the above definitions, we propose a new problem formulation and network architecture by introducing an inductive bias: assigning different and independent roles to content and style when approximating the real data distributions. Specifically, as shown in Figure 1 , we force the content embeddings of individual images to share a common distribution, and the style embeddings are used to scale and shift the common distribution to match target image distribution via a generator. 2020) to apply latent optimization to optimize the embeddings and the parameters of the generator. We also propose to use instance discrimination as a complementary constraint to assist the disentanglement. Please note that we only use the image reconstruction loss as the supervision; no extra labeling is needed. As the content and style perform a different and independent role when modeling the data, they are disentangled to encode the shared and instance-specific features respectively after the optimization. The contributions of our work are as follows: we achieve unsupervised C-S disentanglement by introducing an inductive bias in our formulation: assign different and independent roles to content and style when modeling the real data distributions. Furthermore, we achieve better C-S disentanglement by leveraging instance discrimination. The experiments on several popular datasets demonstrate that our method achieves the state-of-the-art unsupervised C-S disentanglement and comparable or even better results than supervised methods. Besides, we propose to apply C-S disengagement to a new task: single view 3D reconstruction.

2. RELATED WORK

Unsupervised Disentanglement. A disentangled representation can be defined as one where individual latent units are sensitive to changes in individual generative factors. There have been a lot of studies on unsupervised disentangled representation learning (Higgins et al., 2017; Burgess et al., 2018a; Kumar et al., 2017; Kim & Mnih, 2018; Chen et al., 2018) . These models learn disentangled factors by factorizing aggregated posterior. They can also be used for C-S disentanglement. The learned factors can be divided into two categories; one is content-related, the other is style-related. However, Locatello et al. (2019) proved that unsupervised disentanglement is impossible without introducing inductive bias on both models and data. Therefore, these models are currently unable to obtain a promising disentangled representation. Motivated by Locatello et al. (2019) , we revisit and formulate the unsupervised C-S disentanglement problem to introduce inductive bias. C-S Disentanglement. Originated from style transfer, most of the prior works on C-S disentanglement divide latent variables into two spaces relying on supervision. To achieve disentanglement, Mathieu et al. (2016) and Szabó et al. (2018) combine the adversarial constraint and auto-encoders. Meanwhile, VAE (Kingma & Welling, 2014) is used with non-adversarial constraints, such as cycle consistency (Jha et al., 2018) and evidence accumulation (Bouchacourt et al., 2018) . Furthermore, latent optimization is shown to be superior to amortized inference (Gabbay & Hoshen, 2020). Unlike the above works, Wu et al. (2019c) propose a variational U-Net with structure learning for disentanglement in an unsupervised manner. However, this method is limited by the learning of 2D landmarks. In our paper, we formulate C-S disentanglement and explore inductive bias for unsupervised disentanglement. Note that style transfer aims at modifying the domain style of an image while preserving its content, and its formulation focuses on the relation between domains (Huang et al., 2018a) . Our formulation is defined in a single domain but can be extended to cross-domain, as presented in Appendix G.



Figure1: Overview of our framework. c i , c j , c k labelled with different shapes are embeddings sampled from a shared distribution Ψ. s i , s j , s k labelled with different colors are embeddings from the style latent space. f σ and f µ are two fully-connected layers predicting the statistics parameters to scale and shift Ψ respectively to approximate the target image distributions via a Generator. For each generated image from 3 × 3 grid, the content and style embeddings are from the column and row respectively.We follow Bojanowski et al. (2018) and Gabbay & Hoshen (2020) to apply latent optimization to optimize the embeddings and the parameters of the generator. We also propose to use instance discrimination as a complementary constraint to assist the disentanglement. Please note that we only use the image reconstruction loss as the supervision; no extra labeling is needed. As the content and style perform a different and independent role when modeling the data, they are disentangled to encode the shared and instance-specific features respectively after the optimization.

