ADIS-GAN

Abstract

This paper proposes Affine Disentangled GAN (ADIS-GAN), which is a Generative Adversarial Network that can explicitly disentangle affine transformations in a self-supervised and rigorous manner. The objective is inspired by InfoGAN, where an additional affine regularizer acts as the inductive bias. The affine regularizer is rooted in the affine transformation properties of images, changing some properties of the underlying images, while leaving all other properties invariant. We derive the affine regularizer by decomposing the affine matrix into separate transformation matrices and inferring the transformation parameters by maximum likelihood estimation. Unlike the disentangled representations learned by existing approaches, the features learned by ADIS-GAN are axis-aligned and scalable, where transformations such as rotation, horizontal and vertical zoom, horizontal and vertical skew, horizontal and vertical translation can be explicitly selected and learned. ADIS-GAN successfully disentangles these features on the MNIST, CelebA, and dSprites datasets.

1. INTRODUCTION

In a disentangled representation, observations are interpreted in terms of a few explanatory factors. Examples of such factors are the rotation angle(s), scale, or position of an object in an image. Disentangled variables are generally considered as the abstraction of interpretable semantic information and reflection of separatable factors of variation in the data. Many studies have explored the effectiveness of disentangled representations (Bengio et al., 2013; N et al., 2017; LeCun et al., 2015; Lake et al., 2017; Tschannen et al., 2018) . The information presented in observations is encoded in an interpretable and compact manner, e.g., the texture style and the orientation of the objects (Bengio et al., 2013; LeCun et al., 2015; Lake et al., 2017; Tschannen et al., 2018) . The learned representation are more generalizable and can be useful for downstream tasks, such as classification and visualization (Bengio et al., 2013; N et al., 2017; Chen et al., 2016) . The concept of disentangled representation has been defined in several ways in the literature (Locatello et al., 2019; Higgins et al., 2018; Eastwood & Williams, 2018) . The necessity of explicit inductive biases both for learning approaches and the datasets is discussed in Locatello et al. (2019) . Inductive bias refers to the set of assumptions that the learner uses to predict outputs of given inputs that it has not encountered. For instance, in the dSprites dataset objects are displayed at different angles and positions; such prior knowledge helps to detect and classify the objects. However, the inductive biases in existing deep learning models are mostly implicit. The proposed ADIS-GAN utilizes relative affine transformations (see Section 4) as the explicit inductive bias, leading to axisaligned and scalable disentangled representations.

Axis-alignment:

The issue of axis-alignment is addressed in Higgins et al. (2018) , where each latent dimension should have a pre-defined unique axis-alignment. Without axis-alignment, the features learned by disentangled representation need to be identified with expert knowledge after the training, which could be a cumbersome process when dealing with a large number of features. The axis-alignment property also helps to discover desired but non-dominant attributes (e.g., roll angle of face in CelebA dataset).

Scalability:

The scalability property allows us to make a trade-off between the compactness and expressivity of the disentangled representation. For example, the zoom attribute can be decomposed as horizontal and vertical zoom. A more compact representation encodes the zoom attribute by one latent dimension, while a more expressive representation decomposes the zoom attribute as

