POWERS OF LAYERS FOR IMAGE-TO-IMAGE TRANSLATION

Abstract

We propose a simple architecture to address unpaired image-to-image translation tasks: style or class transfer, denoising, deblurring, deblocking, etc. We start from an image autoencoder architecture with fixed weights. For each task we learn a residual block operating in the latent space, which is applied iteratively until the target domain is reached. A specific training schedule is required to alleviate the exponentiation effect of the iterations. At test time, it offers several advantages: the number of weight parameters is limited and the strength of the transformation can be modulated simply with the number of iterations. This is useful, for instance, when the type or amount of noise to suppress is not known in advance. Experimentally, we show that the performance of our model is comparable or better than CycleGAN and Nice-GAN with fewer parameters.

1. INTRODUCTION

Neural networks define arbitrarily complex functions involved in discriminative or generative tasks by stacking layers, as supported by the universal approximation theorem (Hornik et al., 1989; Montúfar, 2014) . More precisely, the theorem states that stacking a number of basic blocks can approximate any function with arbitrary precision, provided it has enough hidden units, with mild conditions on the non-linear basic blocks. Studies on non-linear complex holomorphic functions involved in escape-time fractals showed that iterating simple non-linear functions can also construct arbitrarily complex landscapes (Barnsley et al., 1988) . These functions are complex in the sense that their iso-surfaces are made arbitrarily large by increasing the number of iterations. Yet there is no control on the actual shape of the resulting function. This is why generative fractals remain mathematical curiosities or at best tools to construct intriguing landscapes. In this paper, our objective is to combine the expressive power of both constructions, and we study the optimization of a function that iterates a single building block in the latent space of an auto-encoder. We focus on image translation tasks, that can be trained from either paired or unpaired data. In the paired case, pairs of corresponding input and output images are provided during training. It offers a direct supervision, so the best results are usually obtained with these methods (Chen et al., 2017; Wang et al., 2018; Park et al., 2019) . We focus on unpaired translation: only two corpora of images are provided, one for the input domain A and the other for the output domain B. Therefore we do not have access to any parallel data (Conneau et al., 2017) , which is a realistic scenario in many applications, e.g., image restoration. We train a function f AB : A ! B, such that the output b ⇤ = F (a) for a 2 A is indiscernible from images of B. Our transformation is performed by a single residual block that is composed a variable number of times. We obtain this compositional property thanks to a progressive learning scheme that ensures that the output is valid for a large range of iterations. As a result, we can modulate the strength of the transformation by varying the number of times the transformation is composed. This is of particular interest in image translation tasks such as denoising, where the noise level is unknown at training time, and style transfer, where the user may want to select the best rendering. This "Powers of layers" (PoL) mechanism is illustrated in Figure 1 in the category transfer context (horse to zebra). Our architecture is very simple and only the weights of the residual block differ depending on the task, which makes it suitable to address a large number of tasks with a limited number of parameters. This proposal is in sharp contrast with the trend of current state-of-the-art works to specialize the architecture and to increase its complexity and number of parameters (Fu et al., 2019; Viazovetskyi et al., 2020; Choi et al., 2020) . Despite its simplicity, our proof of concept exhibits similar or better performance than a vanilla CycleGAN architecture, all things being equal otherwise, for the original set of image-to-image translation tasks proposed in their papers, as well as for denoising, deblurring and deblocking. With significantly fewer parameters and a versatile architecture, we report competitive results confirmed by objective and psycho-visual metrics, illustrated by visualizations. We will provide the implementation for the sake of reproducibility.

2. RELATED WORK

Generative adversarial networks (GANs) (Goodfellow et al., 2014) is a framework where two networks, a generator and a discriminator, are learned together in a zero-sum game fashion. The generator learns to produce more and more realistic images w.r.t. the training dataset. The discriminator learns to separate between real data and increasingly realistic generated images. GANs are used in many tasks such as domain adaptation (Bousmalis et al., 2016) , style transfer (Karras et al., 2019b ), inpainting (Pathak et al., 2016) and talking head generation (Zakharov et al., 2019) . Unpaired image-to-image translation considers the tasks of transforming an image from a domain A into an image in a domain B. The training set comprises a sample of images from domains A and B, but no pairs of corresponding images. A classical approach is to train two generators (A ! B and B ! A) and two discriminators, one for each domain. When there is a shared latent space between the domains, a possible choice is to use a variational auto encoder like in CoGAN Liu & Tuzel (2016 ). CycleGAN Zhu et al. (2017 ), DualGAN Yi et al. (2017) and subsequent works (Liu et al., 2019; Fu et al., 2019; Choi et al., 2020) augment the adversarial loss induced by the discriminators with a cycle consistency constraint to preserve semantic information throughout the domain changes. All these variants have architectures roughly similar to CycleGAN: an encoder, a decoder and residual blocks operating on the latent space. They also incorporate elements of other networks such as StyleGAN Karras et al. (2019a) . In our work, we build upon a simplified form of the CycleGAN architecture that generalizes over tasks easily. We applied also our method to NICE-GAN Chen et al. (2020) . A concurrent work, GANHopper by Lira et al. (2020) proposes to iterate CycleGAN generators in order to perform the transformation. However, their method has many difference with ours: They iterate full generators and not a single residual block, their encoder and decoder are not fixed, their number of iteration is fixed and they have to use additional discriminators to act on intermediate transformation states. Other works (Zhang et al., 2019; Li et al., 2020) use recurrent networks to perform transformations but this is done in a paired context, which results in very different methods from ours and GANHopper. Transformation modulation is an interpolation between two image domains. It is a byproduct of some approaches:For instance, a linear interpolation in latent space (Brock et al., 2018; Radford et al., 2015) morphs between two images. Nevertheless, one important limitation is that the starting and ending points must both be known, which is not the case in unpaired learning. Other approaches such as the Fader networks (Lample et al., 2017 ) or StyleGan2 (Viazovetskyi et al., 2020) act on scalar or boolean attributes that are disentangled in the latent space (eg., age for face images, wear glasses or not, etc). Nevertheless, this results in complex models, for which dataset size and the variability of



Figure1: Illustration of Powers of layers for a category transfer task. The encoder and decoder are directly borrowed from a vanilla auto-encoder and are not learnable. At inference time, we apply a variable number of compositions, producing different images depending on how many times we compose the residual block in the embedding space. Depending on the task, we either modulate the transformation and choose the result, or let a discriminator determine when to stop iterating.

