RETHINKING THE TRULY UNSUPERVISED IMAGE-TO-IMAGE TRANSLATION Anonymous authors Paper under double-blind review

Abstract

Every recent image-to-image translation model uses either image-level (i.e. inputoutput pairs) or set-level (i.e. domain labels) supervision at a minimum. However, even the set-level supervision can be a serious bottleneck for data collection in practice. In this paper, we tackle image-to-image translation in a fully unsupervised setting, i.e., neither paired images nor domain labels. To this end, we propose a truly unsupervised image-to-image translation model (TUNIT) that simultaneously learns to separate image domains and translate input images into the estimated domains. Experimental results show that our model achieves comparable or even better performance than the set-level supervised model trained with full labels, generalizes well on various datasets, and is robust against the choice of hyperparameters (e.g. the preset number of pseudo domains). In addition, TU-NIT extends well to the semi-supervised scenario with various amount of labels provided.

1. INTRODUCTION

Given an image of one domain, image-to-image translation is a task to generate the plausible images of the other domains. Based on the success of conditional generative models (Mirza & Osindero, 2014; Sohn et al., 2015) , many image translation methods have been proposed either using imagelevel supervision (e.g. paired data) (Isola et al., 2017; Hoffman et al., 2018; Zhu et al., 2017b; Wang et al., 2018; Park et al., 2019) or using set-level supervision (e.g. domain labels) (Zhu et al., 2017a; Kim et al., 2017; Liu et al., 2017; Huang et al., 2018; Liu et al., 2019; Lee et al., 2020) . Though the latter approach is generally called unsupervised as a counterpart of the former, it actually assumes that the domain labels are given a priori. This assumption can be a serious bottleneck in practice as the number of domains and samples increases. For example, labeling individual samples of a large dataset, such as FFHQ, is expensive, and the distinction across domains can be ambiguous. Here, we first clarify that unsupervised image-to-image translation should strictly denote the task without any supervision neither paired images nor domain labels. Under this definition, our goal is to develop an unsupervised translation model given a mixed set of images of many domains (Figure 1 ). We tackle this problem by formulating three sub-problems: 1) clustering the images by approximating the set-level characteristics (i.e. domains), 2) encoding the individual content and style of an input image, and 3) learning a mapping function among the estimated domains. To this end, we introduce a guiding network that simultaneously solves 1) unsupervised domain classification and 2) style encoding. It has two branches of providing pseudo domain labels and encoding style features, which are later used in the discriminator and the generator training, respectively. We employ a differentiable clustering method based on mutual information maximization for estimating domain labels. This helps the guiding network group similar images together while evenly separate their categories. For embedding style codes, we adopt a contrastive loss (Hadsell et al., 2006; He et al., 2020; Chen et al., 2020a) , which leads the model to further understand the dissimilarity between images, resulting in better representation learning. Finally, conditioned on the style features and domain labels from the guiding network, we use generative adversarial networks (GAN) to learn the image translation functions across various domains. Although GAN and the guiding network play different roles, we do not separate their training process-our guiding network participates in the translation process. By doing so, the guiding network can exploit gradients from GAN training. The guiding network now understands the recipes of domain-separating attributes because the generator wants the style code to contain sufficient information to fool the domain-specific discriminator, and vice versa. Thanks to this interaction between the guiding network and GAN, our model successfully separates domains and translates images. We quantitatively and qualitatively compare our model with the existing set-level supervised method under unsupervised and semi-supervised setting. The experiments on various datasets show that the proposed model outperforms the previous method over all different levels of supervision. Our experimental results show that, by exploiting the synergy between two tasks, the guiding network helps the image translation model to largely improve the generation performance. Our contributions are summarized as follows: • We clarify the definition of unsupervised image-to-image translation and to the best of our knowledge, our model is the first to succeed in this task in an end-to-end manner. • We propose the guiding network to handle the unsupervised translation task and show that the interaction between translation and clustering is helpful for the task. • We show the effectiveness of our model through the extensive experiments on various datasets. • We confirm that our model is applicable to various numbers of clusters and the practical case, where ground truth labels of several samples are available.

2. TRULY UNSUPERVISED IMAGE-TO-IMAGE TRANSLATION (TUNIT)

We consider the unsupervised image-to-image translation problem, where we have images χ from K domains (K ≥ 2) without domain labels y. Here, K is an unknown property of the dataset. Throughout the paper, we denote K as the actual number of domains in a dataset and K as the arbitrarily chosen number of domains to train models. We design a module that integrates both a domain classifier and a style encoder, which we call guiding network. It guides the translation by feeding reference images as the style code to the generator and as the pseudo domain labels to the discriminator. Using the feedback from the discriminator regarding the pseudo labels, the generator synthesizes images of the target domains (e.g. breeds) while respecting styles (e.g. fur patterns) of the reference images and maintaining the content (e.g. pose) of source images (Figure 2 ).

2.1. LEARNING TO PRODUCE DOMAIN LABELS AND ENCODE STYLE FEATURES

In our framework, the guiding network E plays a central role as an unsupervised domain classifier as well as a style encoder. Our guiding network E consists of two branches, E C and E S , each of which learns to provide domain labels and style codes, respectively. In experiments, we compare our guiding network against straightforward approaches, i.e.., K-means on image or feature space. Unsupervised domain classification. The discriminator requires target domain labels to provide useful gradients for image translation into the target domain. E C adopts a differentiable clustering



Figure 1: Levels of supervision. To perform image-to-image translation, existing methods need either (a) a dataset with input-output pairs or, (b) a dataset with domain information. Our method is capable of learning mappings among multiple domains using (c) a dataset without any supervision.

