DINO: A CONDITIONAL ENERGY-BASED GAN FOR DOMAIN TRANSLATION

Abstract

Domain translation is the process of transforming data from one domain to another while preserving the common semantics. Some of the most popular domain translation systems are based on conditional generative adversarial networks, which use source domain data to drive the generator and as an input to the discriminator. However, this approach does not enforce the preservation of shared semantics since the conditional input can often be ignored by the discriminator. We propose an alternative method for conditioning and present a new framework, where two networks are simultaneously trained, in a supervised manner, to perform domain translation in opposite directions. Our method is not only better at capturing the shared information between two domains but is more generic and can be applied to a broader range of problems. The proposed framework performs well even in challenging cross-modal translations, such as video-driven speech reconstruction, for which other systems struggle to maintain correspondence.

1. INTRODUCTION

Domain translation methods exploit the information redundancy often found in data from different domains in order to find a mapping between them. Successful applications of domain translation include image style transfer (Zhu et al., 2017a) and speech-enhancement (Pascual et al., 2017) . Furthermore, these systems are increasingly being used to translate across modalities in applications such as speech-driven animation (Chung et al., 2017) and caption-based image generation (Reed et al., 2016) . Some of the most popular methods for domain translation are based on conditional Generative Adversarial Networks (cGANs) (Mirza & Osindero, 2014) . The conditional information in cGANs is used to drive the generation and to enforce the correspondence between condition and sample. Various alternatives have been proposed for how the condition should be included in the discriminator (Miyato & Koyama, 2018; Reed et al., 2016) but the majority of frameworks provide it as an input, hoping that the sample's correlation with the condition will play a role in distinguishing between synthesized and genuine samples. The main drawback of this approach is that it does not encourage the use of the conditional information and therefore its contribution can be diminished or even ignored. This may lead to samples that are not semantically consistent with the condition. In this paper, we propose the Dual Inverse Network Optimisation (DINO) frameworkfoot_0 which is based on energy-based GANs (Zhao et al., 2017) and consists of two networks that perform translation in opposite directions as shown in Figure 1 . In this framework, one network (Forward network) translates data from the source domain to the target domain while the other (Reverse Network) performs the inverse translation. The Reverse network's goal is to minimize the reconstruction error for genuine data and to maximize it for generated data. The Forward network aims to produce samples that can be accurately reconstructed back to the source domain by the Reverse Network. Therefore, during training the Forward network is trained as a generator and the Reverse as a discriminator. Since discrimination is based on the ability to recover source domain samples, the Forward network is driven to produce samples that are not only realistic but also preserve the shared semantics. We show that this approach is effective across a broad range of supervised translation problems, capturing the correspondence even when domains are from different modalities (i.e., video-audio). In detail, the contributions of this paper are: 

2. RELATED WORK

Domain translation covers a wide range of problems including image-to-image translation (Isola et al., 2017) , caption-based image synthesis (Qiao et al., 2019) , and text-to-speech synthesis (Arik et al., 2017) . Unsupervised translation methods attempt to find a relationship between domains using unpaired training data. However, finding correspondence without supervision is an ill-posed problem which is why these methods often impose additional constraints on their networks or objectives. The majority of unsupervised methods are applied to image-to-image translation problems. The CoGAN model (Liu & Tuzel, 2016) imposes a weight-sharing constraint on specific layers of two GANs, which are trained to produce samples from different domains. The motivation is that sharing weights in layers associated with high-level features should help preserve the overall structure of the images. This approach is extended in the UNIT framework (Liu et al., 2017) , where the generative networks are Variational Autoencoders (VAEs) with a shared latent space. The weight-sharing used in the CoGAN and UNIT frameworks restricts them to problems where both domains are of the same modality. A more generic method of achieving domain-correspondence is presented in the Cycle-GAN model proposed by Zhu et al. (2017a) . The CycleGAN objective includes a cycle-consistency loss to ensure that image translation between two domains is invertible. Recently, Chen et al. (2020) showed that reusing part of the discriminators in CycleGAN as encoders for the generators achieves parameter reduction as well as better results. Although it is possible to apply the cycle consistency loss for cross-modal translation it has not been widely used in such scenarios. Unlike unsupervised methods, supervised approaches rely on having a one-to-one correspondence between the data from different domains. The Pix2Pix model (Isola et al., 2017) uses cGANs to perform image-to-image translation and has inspired many subsequent works (Zhu et al., 2017a; Wang et al., 2018; Park et al., 2019) . Compared to unsupervised methods, supervised approaches have had more success in translating across different modalities. Notable applications include speechdriven facial animation (Vougioukas et al., 2020 ) and text-to-image synthesis (Reed et al., 2016; Qiao et al., 2019) . It is important to note that the adversarial loss in cGANs alone is often not capable of establishing domain correspondence, which is why these approaches also rely on additional reconstruction or perceptual losses (Johnson et al., 2016) in order to accurately capture semantics.



Source code: https://github.com/DinoMan/DINO



Figure 1: Architecture for the DINO framework. The Forward network performs a translation from the source to the target domain. The Reverse network performs the opposite translation and assigns an energy based on the its ability to recover source domain samples from real and generated samples.

