RETHINKING THE TRULY UNSUPERVISED IMAGE-TO-IMAGE TRANSLATION Anonymous authors Paper under double-blind review

Abstract

Every recent image-to-image translation model uses either image-level (i.e. inputoutput pairs) or set-level (i.e. domain labels) supervision at a minimum. However, even the set-level supervision can be a serious bottleneck for data collection in practice. In this paper, we tackle image-to-image translation in a fully unsupervised setting, i.e., neither paired images nor domain labels. To this end, we propose a truly unsupervised image-to-image translation model (TUNIT) that simultaneously learns to separate image domains and translate input images into the estimated domains. Experimental results show that our model achieves comparable or even better performance than the set-level supervised model trained with full labels, generalizes well on various datasets, and is robust against the choice of hyperparameters (e.g. the preset number of pseudo domains). In addition, TU-NIT extends well to the semi-supervised scenario with various amount of labels provided.

1. INTRODUCTION

Given an image of one domain, image-to-image translation is a task to generate the plausible images of the other domains. Based on the success of conditional generative models (Mirza & Osindero, 2014; Sohn et al., 2015) , many image translation methods have been proposed either using imagelevel supervision (e.g. paired data) (Isola et al., 2017; Hoffman et al., 2018; Zhu et al., 2017b; Wang et al., 2018; Park et al., 2019) or using set-level supervision (e.g. domain labels) (Zhu et al., 2017a; Kim et al., 2017; Liu et al., 2017; Huang et al., 2018; Liu et al., 2019; Lee et al., 2020) . Though the latter approach is generally called unsupervised as a counterpart of the former, it actually assumes that the domain labels are given a priori. This assumption can be a serious bottleneck in practice as the number of domains and samples increases. For example, labeling individual samples of a large dataset, such as FFHQ, is expensive, and the distinction across domains can be ambiguous. Here, we first clarify that unsupervised image-to-image translation should strictly denote the task without any supervision neither paired images nor domain labels. Under this definition, our goal is to develop an unsupervised translation model given a mixed set of images of many domains (Figure 1 ). We tackle this problem by formulating three sub-problems: 1) clustering the images by approximating the set-level characteristics (i.e. domains), 2) encoding the individual content and style of an input image, and 3) learning a mapping function among the estimated domains. To this end, we introduce a guiding network that simultaneously solves 1) unsupervised domain classification and 2) style encoding. It has two branches of providing pseudo domain labels and encoding style features, which are later used in the discriminator and the generator training, respectively. We employ a differentiable clustering method based on mutual information maximization for estimating domain labels. This helps the guiding network group similar images together while evenly separate their categories. For embedding style codes, we adopt a contrastive loss (Hadsell et al., 2006; He et al., 2020; Chen et al., 2020a) , which leads the model to further understand the dissimilarity between images, resulting in better representation learning. Finally, conditioned on the style features and domain labels from the guiding network, we use generative adversarial networks (GAN) to learn the image translation functions across various domains. Although GAN and the guiding network play different roles, we do not separate their training process-our guiding network participates in the translation process. By doing so, the guiding network can exploit gradients from GAN training. The guiding network now understands the recipes of

