DECOUPLING GLOBAL AND LOCAL REPRESENTA-TIONS VIA INVERTIBLE GENERATIVE FLOWS

Abstract

In this work, we propose a new generative model that is capable of automatically decoupling global and local representations of images in an entirely unsupervised setting, by embedding a generative flow in the VAE framework to model the decoder. Specifically, the proposed model utilizes the variational auto-encoding framework to learn a (low-dimensional) vector of latent variables to capture the global information of an image, which is fed as a conditional input to a flow-based invertible decoder with architecture borrowed from style transfer literature. Experimental results on standard image benchmarks demonstrate the effectiveness of our model in terms of density estimation, image generation and unsupervised representation learning. Importantly, this work demonstrates that with only architectural inductive biases, a generative model with a likelihood-based objective is capable of learning decoupled representations, requiring no explicit supervision. The code for our model is available at https://github.com/XuezheMax/wolf.

1. INTRODUCTION

Unsupervised learning of probabilistic models and meaningful representation learning are two central yet challenging problems in machine learning. Formally, let X ∈ X be the random variables of the observed data, e.g., X is an image. One goal of generative models is to learn the parameter θ such that the model distribution P θ (X) can best approximate the true distribution P (X). Throughout the paper, uppercase letters represent random variables and lowercase letters their realizations. Unsupervised (disentangled) representation learning, besides data distribution estimation and data generation, is also a principal component in generative models. The goal is to identify and disentangle the underlying causal factors, to tease apart the underlying dependencies of the data, so that it becomes easier to understand, to classify, or to perform other tasks (Bengio et al., 2013) . Unsupervised representation learning has spawned significant interests and a number of techniques (Chen et al., 2017a; Devlin et al., 2019; Hjelm et al., 2019) has emerged over the years to address this challenge. Among these generative models, VAE (Kingma & Welling, 2014; Rezende et al., 2014) and Generative (Normalizing) Flows (Dinh et al., 2014) have stood out for their simplicity and effectiveness.

1.1. VARIATIONAL AUTO-ENCODERS (VAES)

VAE, as a member of latent variable models (LVMs), gains popularity for its capability of automatically learning meaningful (low-dimensional) representations from raw data. In the framework of VAEs, a set of latent variables Z ∈ Z are introduced, and the model distribution P θ (X) is defined as the marginal of the joint distribution between X and Z: p θ (x) = Z p θ (x, z)dµ(z) = Z p θ (x|z)p θ (z)dµ(z), ∀x ∈ X , where the joint distribution p θ (x, z) is factorized as the product of a prior p θ (z) over the latent Z, and the "generative" distribution p θ (x|z). µ(z) is the base measure on the latent space Z.

