ADIS-GAN

Abstract

This paper proposes Affine Disentangled GAN (ADIS-GAN), which is a Generative Adversarial Network that can explicitly disentangle affine transformations in a self-supervised and rigorous manner. The objective is inspired by InfoGAN, where an additional affine regularizer acts as the inductive bias. The affine regularizer is rooted in the affine transformation properties of images, changing some properties of the underlying images, while leaving all other properties invariant. We derive the affine regularizer by decomposing the affine matrix into separate transformation matrices and inferring the transformation parameters by maximum likelihood estimation. Unlike the disentangled representations learned by existing approaches, the features learned by ADIS-GAN are axis-aligned and scalable, where transformations such as rotation, horizontal and vertical zoom, horizontal and vertical skew, horizontal and vertical translation can be explicitly selected and learned. ADIS-GAN successfully disentangles these features on the MNIST, CelebA, and dSprites datasets.

1. INTRODUCTION

In a disentangled representation, observations are interpreted in terms of a few explanatory factors. Examples of such factors are the rotation angle(s), scale, or position of an object in an image. Disentangled variables are generally considered as the abstraction of interpretable semantic information and reflection of separatable factors of variation in the data. Many studies have explored the effectiveness of disentangled representations (Bengio et al., 2013; N et al., 2017; LeCun et al., 2015; Lake et al., 2017; Tschannen et al., 2018) . The information presented in observations is encoded in an interpretable and compact manner, e.g., the texture style and the orientation of the objects (Bengio et al., 2013; LeCun et al., 2015; Lake et al., 2017; Tschannen et al., 2018) . The learned representation are more generalizable and can be useful for downstream tasks, such as classification and visualization (Bengio et al., 2013; N et al., 2017; Chen et al., 2016) . The concept of disentangled representation has been defined in several ways in the literature (Locatello et al., 2019; Higgins et al., 2018; Eastwood & Williams, 2018) . The necessity of explicit inductive biases both for learning approaches and the datasets is discussed in Locatello et al. (2019) . Inductive bias refers to the set of assumptions that the learner uses to predict outputs of given inputs that it has not encountered. For instance, in the dSprites dataset objects are displayed at different angles and positions; such prior knowledge helps to detect and classify the objects. However, the inductive biases in existing deep learning models are mostly implicit. The proposed ADIS-GAN utilizes relative affine transformations (see Section 4) as the explicit inductive bias, leading to axisaligned and scalable disentangled representations.

Axis-alignment:

The issue of axis-alignment is addressed in Higgins et al. (2018) , where each latent dimension should have a pre-defined unique axis-alignment. Without axis-alignment, the features learned by disentangled representation need to be identified with expert knowledge after the training, which could be a cumbersome process when dealing with a large number of features. The axis-alignment property also helps to discover desired but non-dominant attributes (e.g., roll angle of face in CelebA dataset).

Scalability:

The scalability property allows us to make a trade-off between the compactness and expressivity of the disentangled representation. For example, the zoom attribute can be decomposed as horizontal and vertical zoom. A more compact representation encodes the zoom attribute by one latent dimension, while a more expressive representation decomposes the zoom attribute as We motivate the importance of axis-alignment and scalability in particular for affine transformations (see Figure 1 ), where disentangling object poses from texture and shape is an attractive property of an algorithm in the imaging domain (Jaderberg et al., 2015; Bepler et al., 2019; Engstrom et al., 2019) . In supervised learning tasks, Spatial Transformer Network Jaderberg et al. ( 2015) can actively spatially transform an image by providing a proper affine transformation matrix. In unsupervised learning tasks, few algorithms have successfully disentangled the affine transformation. In Bepler et al. ( 2019), an algorithm is introduced that disentangles rotation and translation but not an entire affine transformation. We propose ADIS-GAN, which is a Generative Adversarial Network that utilizes the affine regularizer (see Section 4) as an inductive bias to explicitly disentangle the affine transformation. The affine regularizer is rooted in the affine transformation properties of the images, that affect certain properties of the underlying images, while leaving all other properties invariant. We derive the affine regularizer by decomposing the affine matrix into separate transformations and inferring the transformation parameters by maximum likelihood estimation. Unlike the disentangled representations learned by existing approaches, the features learned by ADIS-GAN are axis-aligned and scalable, where transformations such as rotation, horizontal and vertical zoom, horizontal and vertical skew, horizontal and vertical translation can be explicitly selected and learned (see Figure 1 ). In the remainder of the paper, we review related work in Section 2. We then compare the difference between GAN, InfoGAN and the proposed method in Section 3. We introduce the ADIS-GAN in Section 4, while in Section 5, we show numerical results showing the axis-aligned and scalable disentangled representation learned by ADIS-GAN. We offer concluding remarks in Section 6.

Our contributions:

1. To the best of our knowledge, ADIS-GAN is the first algorithm that can disentangle an entire affine transformation, including rotation, horizontal and vertical zoom, horizontal and vertical skew, horizontal and vertical translation in a self-supervised manner. 2. The disentangled representations obtained by ADIS-GAN are axis-aligned. The advantages are two-folds: a. The attributes to be learned can be pre-defined, which saves the effort to identify the attributes after the training. b. Desired but non-dominant attributes can be learned, in parallel with the dominant attributes. 3. The disentangled representations obtained by ADIS-GAN are scalable. The scalability property makes it possible to make a trade-off between the compactness and expressivity of the learned representation.



Figure 1: Disentangled representation of affine transformations on the MNIST dataset. c 1 : rotation, c 2 : horizontal zoom, c 3 : vertical zoom, c 4 : horizontal skew, c 5 : vertical skew, c 6 : horizontal translation, c 7 : vertical translation. To the best of our knowledge, ADIS-GAN is the first algorithm that can disentangle an entire affine transformation in a self-supervised manner.

