CCGAN: CONTINUOUS CONDITIONAL GENERATIVE ADVERSARIAL NETWORKS FOR IMAGE GENERATION

Abstract

This work proposes the continuous conditional generative adversarial network (CcGAN), the first generative model for image generation conditional on continuous, scalar conditions (termed regression labels). Existing conditional GANs (cGANs) are mainly designed for categorical conditions (e.g., class labels); conditioning on regression labels is mathematically distinct and raises two fundamental problems: (P1) Since there may be very few (even zero) real images for some regression labels, minimizing existing empirical versions of cGAN losses (a.k.a. empirical cGAN losses) often fails in practice; (P2) Since regression labels are scalar and infinitely many, conventional label input methods (e.g., combining a hidden map of the generator/discriminator with a one-hot encoded label) are not applicable. The proposed CcGAN solves the above problems, respectively, by (S1) reformulating existing empirical cGAN losses to be appropriate for the continuous scenario; and (S2) proposing a novel method to incorporate regression labels into the generator and the discriminator. The reformulation in (S1) leads to two novel empirical discriminator losses, termed the hard vicinal discriminator loss (HVDL) and the soft vicinal discriminator loss (SVDL) respectively, and a novel empirical generator loss. The error bounds of a discriminator trained with HVDL and SVDL are derived under mild assumptions in this work. A new benchmark dataset, RC-49, is also proposed for generative image modeling conditional on regression labels. Our experiments on the Circular 2-D Gaussians, RC-49, and UTKFace datasets show that CcGAN is able to generate diverse, high-quality samples from the image distribution conditional on a given regression label. Moreover, in these experiments, CcGAN substantially outperforms cGAN both visually and quantitatively.

1. INTRODUCTION

Conditional generative adversarial networks (cGANs), first proposed in (Mirza & Osindero, 2014) , aim to estimate the distribution of images conditioning on some auxiliary information, especially class labels. Subsequent studies (Odena et al., 2017; Miyato & Koyama, 2018; Brock et al., 2019; Zhang et al., 2019) confirm the feasibility of generating diverse, high-quality (even photo-realistic), and class-label consistent fake images from class-conditional GANs. Unfortunately, these cGANs do not work well for image generation with continuous, scalar conditions, termed regression labels, due to two problems: (P1) cGANs are often trained to minimize the empirical versions of their losses (a.k.a. the empirical cGAN losses) on some training data, a principle also known as the empirical risk minimization (ERM) (Vapnik, 2000) . The success of ERM relies on a large sample size for each distinct condition. Unfortunately, we usually have only a few real images for some regression labels. Moreover, since regression labels are continuous, some values may not even appear in the training set. Consequently, a cGAN cannot accurately estimate the image distribution conditional on such missing labels. (P2) In class-conditional image generation, class labels are often encoded by one-hot vectors or label embedding and then fed into the generator and discriminator by hidden concatenation (Mirza & Osindero, 2014) , an auxiliary classifier (Odena et al., 2017) or label projection (Miyato & Koyama, 2018) . A precondition for such label encoding is that the number of distinct labels (e.g., the number of classes) is finite and known. Unfortunately, in the continuous scenario, we may have infinite distinct regression labels. A naive approach to solve (P1)-(P2) is to "bin" the regression labels into a series of disjoint intervals and still train a cGAN in the class-conditional manner (these interval are treated as independent classes) (Olmschenk, 2019) . However, this approach has four shortcomings: (1) our experiments in Section 4 show that this approach often makes cGANs collapse; (2) we can only estimate the image distribution conditional on membership in an interval and not on the target label; (3) a large interval width leads to high label inconsistency; (4) inter-class correlation is not considered (images in successive intervals have similar distributions). In machine learning, vicinal risk minimization (VRM) (Vapnik, 2000; Chapelle et al., 2001) is an alternative rule to ERM. VRM assumes that a sample point shares the same label with other samples in its vicinity. Motivated by VRM, in generative modeling conditional on regression labels where we estimate a conditional distribution p(x|y) (x is an image and y is a regression label), it is natural to assume that a small perturbation to y results in a negligible change to p(x|y). This assumption is consistent with our perception of the world. For example, the image distribution of facial features for a population of 15-year-old teenagers should be close to that of 16-year olds. We therefore introduce the continuous conditional GAN (CcGAN) to tackle (P1) and (P2). To our best knowledge, this is the first generative model for image generation conditional on regression labels. It is noted that Rezagholizadeh et al. (2018) and Rezagholiradeh & Haidar (2018) train GANs in an unsupervised manner and synthesize unlabeled fake images for a subsequent image regression task. Olmschenk et al. (2019) proposes a semi-supervised GAN for dense crowd counting. CcGAN is fundamentally different from these works since they do not estimate the conditional image distribution. Our contributions can be summarized as follows: • We propose in Section 2 the CcGAN to address (P1) and (P2), which consists of two novel empirical discriminator losses, termed the hard vicinal discriminator loss (HVDL) and the soft vicinal discriminator loss (SVDL), a novel empirical generator loss, and a novel label input method. We take the vanilla cGAN loss as an example to show how to derive HVDL, SVDL, and the novel empirical generator loss by reformulating existing empirical cGAN losses. • We derive in Section 3 the error bounds of a discriminator trained with HVDL and SVDL. • In Section 4, we propose a new benchmark dataset, RC-49, for the generative image modeling conditional on regression labels, since very few benchmark datasets are suitable for the studied continuous scenario. We conduct experiments on several datasets, and our experiments show that CcGAN not only generates diverse, high-quality, and label consistent images, but also substantially outperforms cGAN both visually and quantitatively.

2. FROM CGAN TO CCGAN

In this section, we provide the solutions (S1)-( S2) to (P1)-(P2) in a one-to-one manner by introducing the continuous conditional GAN (CcGAN). Please note that theoretically cGAN losses (e.g., the vanilla cGAN loss (Mirza & Osindero, 2014) , the Wasserstein loss (Arjovsky et al., 2017) , and the hinge loss (Miyato et al., 2018) ) are suitable for both class labels and regression labels; however, their empirical versions fail in the continuous scenario (i.e., (P1)). Our first solution (S1) focuses on reformulating these empirical cGAN losses to fit into the continuous scenario. Without loss of generality, we only take the vanilla cGAN loss as an example to show such reformulation (the empirical versions of the Wasserstein loss and the hinge loss can be reformulated similarly). The vanilla discriminator loss and generator loss (Mirza & Osindero, 2014) = -log(D(x, y))p r (x, y)dxdy -log(1 -D(x, y))p g (x, y)dxdy, L(G) = -E y∼pg(y) E z∼q(z) [log (D(G(z, y) , y))] = -log(D(G(z, y), y))q(z)p g (y)dzdy, (2) where x ∈ X is an image of size d × d, y ∈ Y is a label, p r (y) and p g (y) are respectively the true and fake label marginal distributions, p r (x|y) and p g (x|y) are respectively the true and fake image distributions conditional on y, p r (x, y) and p g (x, y) are respectively the true and fake joint distributions of x and y, and q(z) is the probability density function of N (0, I). Since the distributions in the losses of Eqs. ( 1) and ( 2) are unknown, for class-conditional image generation, Mirza & Osindero (2014) follows ERM and minimizes the empirical losses: L δ (D) = - 1 N r C c=1 N r c j=1 log(D(x r c,j , c)) - 1 N g C c=1 N g c j=1 log(1 -D(x g c,j , c)), L δ (G) = - 1 N g C c=1 N g c j=1 log(D(G(z c,j , c), c)), ( ) where C is the number of classes, N r and N g are respectively the number of real and fake images, N r c and N g c are respectively the number of real and fake images with label c, x r c,j and x g c,j are respectively the j-th real image and the j-th fake image with label c, and the z c,j are independently and identically sampled from q(z). Eq. (3) implies we estimate p r (x, y) and p g (x, y) by their empirical probability density functions as follows: pδ r (x, y) = 1 N r C c=1 N r c j=1 δ(x -x r c,j )δ(y -c), pδ g (x, y) = 1 N g C c=1 N g c j=1 δ(x -x g c,j )δ(y -c), where δ(•) is a Dirac delta mass centered at 0. However, pδ r (x, y) and pδ g (x, y) in Eq. ( 5) are not good estimates in the continuous scenario because of (P1). To overcome (P1), we propose a novel estimate for each of p r (x, y) and p g (x, y), termed the hard vicinal estimate (HVE). We also provide an intuitive alternative to HVE, named the soft vicinal estimate (SVE). The HVEs of p r (x, y) and p g (x, y) are: pHVE r (x, y) = C 1 •   1 N r N r j=1 exp - (y -y r j ) 2 2σ 2   • 1 N r y,κ N r i=1 1 {|y-y r i |≤κ} δ(x -x r i ) , pHVE g (x, y) = C 2 •   1 N g N g j=1 exp - (y -y g j ) 2 2σ 2   • 1 N g y,κ N g i=1 1 {|y-y g i |≤κ} δ(x -x g i ) , where x r i and x g i are respectively real image i and fake image i, y r i and y g i are respectively the labels of x r i and x g i , κ and σ are two positive hyper-parameters, C 1 and C 2 are two constants making these two estimates valid probability density functions, N r y,κ is the number of the y r i satisfying |y -y r i | ≤ κ, N g y,κ is the number of the y g i satisfying |y -y g i | ≤ κ, and 1 is an indicator function with support in the subscript. The terms in the first square brackets of pHVE r and pHVE g imply we estimate the marginal label distributions p r (y) and p g (y) by kernel density estimates (KDEs) (Silverman, 1986) . The terms in the second square brackets are designed based on the assumption that a small perturbation to y results in negligible changes to p r (x|y) and p g (x|y). If this assumption holds, we can use images with labels in a small vicinity of y to estimate p r (x|y) and p g (x|y). The SVEs of p r (x, y) and p g (x, y) are: pSVE r (x, y) = C 3 •   1 N r N r j=1 exp - (y -y r j ) 2 2σ 2   • N r i=1 w r (y r i , y)δ(x -x r i ) N r i=1 w r (y r i , y) , pSVE g (x, y) = C 4 •   1 N g N g j=1 exp - (y -y g j ) 2 2σ 2   • N g i=1 w g (y g i , y)δ(x -x g i ) N g i=1 w g (y g i , y) , ( ) where C 3 and C 4 are two constants making these two estimates valid probability density functions, w r (y r i , y) = e -ν(y r i -y) 2 and w g (y g i , y) = e -ν(y g i -y) 2 , (8) and the hyper-parameter ν > 0. In Eq. ( 7), similar to the HVEs, we estimate p r (y) and p g (y) by KDEs. Instead of using samples in a hard vicinity, the SVEs use all respective samples to estimate p r (x|y) and p g (x|y) but each sample is assigned with a weight based on the distance of its label from y. Two diagrams in Fig. 1 visualize the process of using hard/soft vicinal samples to estimate p(x|y), i.e., a univariate Gaussian distribution conditional on its mean y. 6)) and SVE (Eq. ( 7)) estimate p(x|y) (a univariate Gaussian conditional on y) using two samples in hard and soft vicinities, respectively, of y. To estimate p(x|y) (the red Gaussian curve) only from samples drawn from p(x|y 1 ) and p(x|y 2 ) (the blue Gaussian curves), estimation is based on the samples (red dots) in a hard vicinity (defined by y ± κ) or a soft vicinity (defined by the weight decay curve) around y. The histograms in blue are samples in the hard or soft vicinity. The labels y 1 , y, and y 2 on the x-axis denote the means of x conditional on y 1 , y, and y 2 , respectively. By plugging Eq. ( 6) and ( 7) into Eq. ( 1), we derive the hard vicinal discriminator loss (HVDL) and the soft vicinal discriminator loss (SVDL) as follows: L HVDL (D) = - C 5 N r N r j=1 N r i=1 E r ∼N (0,σ 2 ) 1 {|y r j + r -y r i |≤κ} N r y r j + r ,κ log(D(x r i , y r j + r )) - C 6 N g N g j=1 N g i=1 E g ∼N (0,σ 2 ) 1 {|y g j + g -y g i |≤κ} N g y g j + g ,κ log(1 -D(x g i , y g j + g )) , L SVDL (D) = - C 7 N r N r j=1 N r i=1 E r ∼N (0,σ 2 ) w r (y r i , y r j + r ) N r i=1 w r (y r i , y r j + r ) log(D(x r i , y r j + r )) - C 8 N g N g j=1 N g i=1 E g ∼N (0,σ 2 ) w g (y g i , y g j + g ) N g i=1 w g (y g i , y g j + g ) log(1 -D(x g i , y g j + g )) , (10) where r y -y r j , g y -y g j , and C 5 , C 6 , C 7 , and C 8 are some constants.

Generator training:

The generator of CcGAN is trained by minimizing Eq. ( 11), L (G) = - 1 N g N g i=1 E g ∼N (0,σ 2 ) log(D(G(z i , y g i + g ), y g i + g )). ( ) How do HVDL, SVDL, and Eq. ( 11) overcome (P1)? The solution (S1) includes: (i) Given a label y as the condition, we use images in a hard/soft vicinity of y to train the discriminator instead of just using images with label y. It enables us to estimate p r (x|y) when there are not enough real images with label y. (ii) From Eqs. ( 9) and ( 10), we can see that the KDEs in Eqs. ( 6) and ( 7) are adjusted by adding Gaussian noise to the labels. Moreover, in Eq. ( 11), we add Gaussian noise to seen labels (assume y g i 's are seen) to train the generator to generate images at unseen labels. This enables estimation of p r (x|y ) when y is not in the training set. How is (P2) solved? We propose a novel label input method. For G, we add the label y elementwisely to the output of its first linear layer. For D, an extra linear layer is trained together with D to embed y in a latent space. We then incorporate the embedded label into D by the label projection (Miyato & Koyama, 2018) . Please refer to Supp. S.3 for more details. Remark 1. An algorithm is proposed in Supp. S.2 for training CcGAN in practice. Moreover, CcGAN does not require any specific network architecture, therefore it can also use the state-of-art architectures in practice such as SNGAN (Miyato et al., 2018) and BigGAN (Brock et al., 2019) .

3. ERROR BOUNDS

In this section, we derive the error bounds of a discriminator trained with L HVDL and L SVDL under the theoretical loss L. , p g w (y |y) w g (y ,y)p g (y ) W g (y) , W r (y) w r (y , y)p r (y )dy and W g (y) w g (y , y)p g (y )dy . Denote by D * the optimal discriminator (Goodfellow et al., 2014)  Σ(L) {p : ∀t 1 , t 2 ∈ Y, ∃L > 0, s.t.|p (t 1 ) -p (t 2 )| ≤ L|t 1 -t 2 |} . Please see Supp. S.5.1 for more details of these notations. Moreover, we will also work with the following assumptions: Theorem 1. Assume that (A1)-(A4) hold, then ∀δ ∈ (0, 1), with probability at least 1 -δ, L( D HVDL ) -L(D * ) ≤2U   C KDE 1,δ log N r N r σ + L r σ 2   + 2U   C KDE 2,δ log N g N g σ + L g σ 2   + κU (M r + M g ) + 2U 1 2 log 8 δ E y∼ pKDE r (y) 1 N r y,κ + E y∼ pKDE g (y) 1 N g y,κ + L( D) -L(D * ), ) for some constants C KDE 1,δ , C KDE 2,δ depending on δ. Theorem 2. Assume that (A1)-(A4) hold, then ∀δ ∈ (0, 1), with probability at least 1 -δ, L( D SVDL ) -L(D * ) ≤2U   C KDE 1,δ log N r N r σ + L r σ 2   + 2U   C KDE 2,δ log N g N g σ + L g σ 2   + 2U 1 2 log 16 δ 1 √ N r E y∼ pKDE r (y) 1 W r (y) + 1 √ N g E y∼ pKDE g (y) 1 W g (y) + U M r E y∼ pKDE r (y) E y ∼ pr w (y |y) |y -y| + M g E y∼ pKDE g (y) E y ∼ pg w (y |y) |y -y| + L( D) -L(D * ), for some constant C KDE 1,δ , C KDE 2,δ depending on δ. Remark 2. The error bounds in both theorems reflect the distance of D HVDL and D SVDL from D * . Enlightened by the two upper bounds, when implementing CcGAN, we should (1) avoid letting D output extreme values (close to 0 or 1) so that U is kept at a moderate level; (2) avoid using a too small or a too large κ or ν to keep the third and fourth terms moderate in Eqs. ( 13) and ( 14). Please see Supp. S.5.2.5 for a more detailed interpretation and Supp. S.5.2 for the proofs.

4. EXPERIMENT

In this section, we study the effectiveness of CcGAN on three datasets where cGAN (Mirza & Osindero, 2014) For testing, we choose 360 points evenly distributed on the unit circle as the means of 360 Gaussians. For each Gaussian, we generate 100 samples, yielding a test set with 36,000 samples. It should be noted that, among these 360 Gaussians, at least 240 are not used at the training. In other words, there are at least 240 labels in the testing set which do not appear in the training set. For each test angle, we generate 100 fake samples from each trained GAN, yielding 36,000 fake samples from each GAN in total. The quality of these fake samples is evaluated. We repeat the whole experiment three times and report in Table 1 the average quality over three repetitions. Evaluation metrics and quantitative results: In the label-conditional scenario, each fake sample x with label y is compared with the mean (sin(y), cos(y)) of a Gaussian on the unit circle with label y. A fake sample is defined as "high-quality" if its Euclidean distance from x to (sin(y), cos(y)) is smaller than 4σ = 0.08. A mode (i.e., a Gaussian) is said to be recovered if at least one high-quality sample is assigned to it. We also measure the quality of fake samples with label y by computing the 2-Wasserstein Distance (W 2 ) (Peyré et al., 2019) between p r (x|y) = N ([sin(y), cos(y)] , σI) and p g (x|y) = N (µ g y , Σ g y ), where we assume p g (x|y) is Gaussian and its mean and covariance are estimated by the sample mean and sample covariance of 100 fake samples with label y. In Table 1 , we report the average percentage of high-quality fake samples and the average percentage of recovered modes over 3 repetitions. We also report the average W 2 over 360 testing angles. We can see CcGAN substantially outperforms cGAN. Visual results: We select 12 angles which do not appear in the training set. We then use cGAN and CcGAN to generate 100 samples for each unobserved angle. Fig. 2 visually confirms the obervation from the numerical metrics: the fake samples from the two CcGAN methods are more realistic.

4.2. RC-49

Since most benchmark datasets in the GAN literature do not have continuous, scalar regression labels, we propose a new benchmark dataset-RC-49, a synthetic dataset created by rendering 49 3-D chair Table 1 : Average quality of 36,000 fake samples from cGAN and CcGAN over three repetitions with standard deviations after the "±" symbol. "↓" ("↑") indicates lower (higher) values are preferred. When training cGAN, we divide [0.1, 89.9] into 150 equal intervals where each interval is treated as a class. When training CcGAN, we use the rule of thumb formulae in Supp. S.4 to select the three hyper-parameters of HVDL and SVDL, i.e., σ ≈ 0.047, κ ≈ 0.004 and ν = 50625. Both cGAN and CcGAN are trained for 30,000 iterations with batch size 256. Afterwards, we evaluate the trained GANs on all 899 angles by generating 200 fake images for each angle. Please see Supp. S.7 for the network architectures and more details about the training/testing setup.

Quantitative and visual results:

To evaluate (1) the visual quality, (2) the intra-label diversity, and (3) the label consistency (whether assigned labels of fake images are consistent with their true labels) of fake images, we study an overall metric and three separate metrics here. (i) Intra-FID (Miyato & Koyama, 2018) is utilized as the overall metric. It computes the Fréchet inception distance (FID) (Heusel et al., 2017) separately at each of the 899 evaluation angles and reports the average FID score. (ii) Naturalness Image Quality Evaluator (NIQE) (Mittal et al., 2012) measures the visual quality only. (iii) Diversity is the average entropy of predicted chair types of fake images over evaluation angles. (iv) Label Score is the average absolute error between assigned labels and predicted labels. Please see Supp. S.7.5 for details of these metrics. We report in Table 2 the performances of each GAN. The example fake images in Fig. 3 and line graphs in Fig. 5 support the quantitative results. cGAN often generates unrealistic, identical images for a target angle (i.e., low visual quality and low intra-label diversity). "Binning" [0.1, 89.9] into other number of classes (e.g., 90 classes and 210 classes) is also tried but does not improve cGAN's performance. In contrast, strikingly better visual quality and higher intra-label diversity of both CcGAN methods are visually evident. Please note that CcGAN is designed to sacrifice some (not too much) label consistency for better visual quality and higher diversity, and this explains why CcGAN does not outperform cGAN in terms of the label score in Table 2 . Quantitative and visual results: Similar to the RC-49 experiment, we evaluate the quality of fake images by Intra-FID, NIQE, Diversity (entropy of predicted races), and Label Score. We report in Table 3 the average quality of 60,000 fake images. We also show in Fig. 4 some example fake images from cGAN and CcGAN and line graphs of FID/NIQE versus ages in Fig. 5 . Analogous to the quantitative comparisons, we can see that CcGAN performs much better than cGAN. Table 3 : Average quality of 60,000 fake UTKFace images from cGAN and CcGAN with standard deviations after the "±" symbol. "↓" ("↑") indicates lower (higher) values are preferred. 

5. CONCLUSION

As the first generative model, we propose the CcGAN in this paper for image generation conditional on regression labels. In CcGAN, two novel empirical discriminator losses (HVDL and SVDL), a novel empirical generator loss and a novel label input method are proposed to overcome the two problems of existing cGANs. The error bounds of a discriminator trained under HVDL and SVDL are studied in this work. A new benchmark dataset, RC-49, is also proposed for the continuous scenario. Finally we demonstrate the superiority of the proposed CcGAN to cGAN on the Circular 2-D Gaussians, RC-49, and UTKFace datasets. Create another set of target labels Y g, = {yi + |yi ∈ Y g , ∈ N (0, σ 2 ), i = 1, . . . , m g } (G training is conditional on these labels) ; Generate m g fake images conditional on Y g, and put these image-label pairs in Ω f g ; Update G with samples in Ω f g via gradient-based optimizers based on Eq.( 11) ; end Remark S.3. It should be noted that, for computational efficiency, the normalizing constants N r y r j + r ,κ , N g y g j + g ,κ , N r i=1 w r (y r i , y r j + r ), and N g i=1 w g (y g i , y g j + g ) in Eq. ( 9) and ( 10) are excluded from the training and only used for theoretical analysis.

S.3 MORE DETAILS OF THE PROPOSED LABEL INPUT METHOD IN SECTION 2

We propose a novel way to input labels to the conditional generative adversarial networks. For the generator, we add a regression label element-wise to the feature map of the first linear layer. For the discriminator, labels are first projected to a latent space learned by an extra linear layer. Then, we incorporate the embedded labels into the discriminator by the label projection (Miyato & Koyama, 2018) . Figs. S.3.6 and S.3.7 visualizes our proposed label input method. Please refer to our codes for more details.

S.4 A RULE OF THUMB FOR HYPER-PARAMETER SELECTION

In our experiments, we normalize labels to real numbers in [0, 1] and the hyper-parameter selection is conducted based on the normalized labels. To be more specific, the hyper-parameter σ is computed based on a rule-of-thumb formula for the bandwidth selection of KDE (Silverman, 1986) , i.e.,  d, = {yi + |yi ∈ Y d , ∈ N (0, σ 2 ), i = 1, . . . , m d } (D training is conditional on these labels) ; Initialize Ω r d = φ, Ω f d = φ; for i = 1 to m d do Randomly choose an image-label pair (x, y) ∈ Ω r satisfying e -ν(y-y i -) 2 > 10 -3 where yi + ∈ Y d, and let Ω r d = Ω r d ∪ (x, yi + ). This step is used to exclude real images with too small weights. ; Compute w r i (y, yi + ) = e -ν(y i + -y) 2 ; Randomly draw a label y from U (yi + --log 10 -3 ν , yi + + -log 10 -3 ν ) and generate a fake image x by evaluating G(z, y ), where z ∼ N (0, I). Let Ω f d = Ω f d ∪ (x , yi + ). ; Compute w g i (y , yi + ) = e -ν(y i + -y ) 2 ; end Update D with samples in set Ω r d and Ω f d via gradient-based optimizers based on Eq.( 7); Train G; Draw m g labels Y g with replacement from Υ; Create another set of target labels Y g, = {yi + |yi ∈ Y g , ∈ N (0, σ 2 ), i = 1, . . . , m g } (G training is conditional on these labels) ; Generate m g fake images conditional on Y g, and put these image-label pairs in Ω f g ; Update G with samples in Ω f g via gradient-based optimizers based on Eq.( 11  κ base = max y r [2] -y r [1] , y r [3] -y r [2] , . . . , y r [N r uy ] -y r [N r uy -1] , where y r [l] is the l-th smallest normalized distinct real label and N r uy is the number of normalized distinct labels in the training set. The κ is set as a multiple of κ base (i.e., κ = m κ κ base ) where the multiplier m κ stands for 50% of the minimum number of neighboring labels used for estimating p r (x|y) given a label y. For example, m κ = 1 implies using 2 neighboring labels (one on the left while the other one on the right). In our experiments, m κ is generally set as 1 or 2. In some extreme case when many distinct labels have too few real samples, we may consider increasing m κ . We also found ν = 1/κ 2 works well in our experiments. S.5 MORE DETAILS OF THEOREMS S.4 AND S.5

S.5.1 SOME NECESSARY DEFINITIONS AND NOTATIONS

• The hypothesis space D is a set of functions that can be represented by D (a neural network with determined architecture). • In the HVDL case, denote by p y,κ r (x) y+κ y-κ p r (x|y )p r (y )dy the marginal distribution of real images with labels in [y -κ, y + κ] and similarly to p y,κ g (x) of fake images. • In the SVDL case, given y and weight functions (E.q. (8)), if the number of real and fake images are infinite, the empirical density converges to p y,w r r (x) p r (x|y ) w r (y ,y)pr(y ) W r (y) dy and p y,w g g (x) p g (x|y ) w g (y ,y)pg(y) W g (y) dy respectively, where W r (y) w r (y , y)p r (y )dy and W g (y) w g (y , y)p g (y )dy . • Let p r w (y |y) w r (y ,y)p r (y ) W r (y) and p g w (y |y) w g (y ,y)p g (y ) W g (y) . • The Hölder Class defined in Definition 1 is a set of functions with bounded second derivatives, which controls the variation of the function when parameter changes. (A4) implies the two probability density functions p r (y) and p g (y) are assumed in the Hölder Class. • Given a G, the optimal discriminator which minimizes L is in the form of Recall notations and assumptions in Sections 3 and S.5.1, then we derive the following lemmas. D * (x, y) = p r (x, Lemma S.1. Suppose that (A1)-( A2) and (A4) hold, then ∀δ ∈ (0, 1), with probability at least 1 -δ, sup D∈D 1 N r y,κ N r i=1 1 {|y-y r i |≤κ} [-log D(x r i , y)] -E x∼pr(x|y) [-log D(x, y)] ≤ U 1 2N r y,κ log 2 δ + κU M r 2 , (S.16) for a given y. Proof. Triangle inequality yields sup D∈D 1 N r y,κ N r i=1 1 {|y-y r i |≤κ} [-log D(x r i , y)] -E x∼pr(x|y) [-log D(x, y)] ≤ sup D∈D 1 N r y,κ N r i=1 1 {|y-y r i |≤κ} [-log D(x r i , y)] -E x∼p y,κ r (x) [-log D(x, y)] + sup D∈D E x∼p y,κ r (x) [-log D(x, y)] -E x∼pr(x|y) [-log D(x, y)] We then bound the two terms of the RHS separately as follows: 1. Real images with labels in [y -κ, k + κ] can be seen as independent samples from p y,κ r (x). Then the first term can be bounded by applying Hoeffding's inequality as follows: ∀δ ∈ (0, 1), with at least probability 1 -δ, sup D∈D 1 N r y,κ N r i=1 1 {|y-y r i |≤κ} U -log D(x r i , y) U -E x∼p y,κ r (x) U -log D(x, y) U ≤ U 1 2N r y,κ log 2 δ . (S.17 By combining Eq. (S.17) and (S.19), we can get Eq. (S.16), which finishes the proof. Similarly, we apply identical proof strategy to the fake images x g and generator distribution p g (x|y). Lemma S.2. Suppose that (A1), (A3) and (A4) hold, then ∀δ ∈ (0, 1), with probability at least 1 -δ, sup D∈D 1 N g y,κ N g i=1 1 {|y-y g i |≤κ} [-log(1 -D(x i , y))] -E x∼pg(x|y) [-log(1 -D(x, y))] ≤ U 1 2N g y,κ log 2 δ + κU M g 2 , (S.20) for a given y. Proof. This proof is omitted because it is almost identical to the one for Lemma S.1. The following two lemmas provide the bounds for SVDL. Lemma S.3. Suppose that (A1), (A2) and (A4) hold, then ∀δ ∈ (0, 1), with probability at least 1 -δ, sup D∈D 1 N r N r i=1 w r (y r i , y) [-log D(x r i , y)] 1 N r N r i=1 w r (y r i , y) -E x∼pr(x|y) [-log D(x, y)] ≤ U W r (y) 1 2N r log 4 δ + U M r 2 E y ∼p r w (y |y) [|y -y|] , (S.21) for a given y. Proof. For brevity, denote by f (x, y) = -log D(x, y) and F = -log D. Then, sup D∈D 1 N r N r i=1 w r (y r i , y) [-log D(x r i , y)] 1 N r N r i=1 w r (y r i , y) -E x∼pr(x|y) [-log D(x, y)] = sup f ∈F 1 N r N r i=1 w r (y r i , y)f (x r i , y) 1 N r N r i=1 w r (y r i , y) -E x∼pr(x|y) [f (x, y)] ≤ sup f ∈F 1 N r N r i=1 w r (y r i , y)f (x r i , y) 1 N r N r i=1 w r (y r i , y) -E x∼p y,w r r (x) [f (x, y)] + sup f ∈F E x∼p y,w r r (x) [f (x, y)] -E x∼pr(x|y) [f (x, y)] (S.22) where the inequality is by triangular inequality. We then derive bounds for both two terms of the last line. 1. For the first term, we can further split it into two parts, 1 N r N r i=1 w r (y r i , y)f (x r i , y) 1 N r N r i=1 w r (y r i , y) -E x∼p y,w r r (x) [f (x, y)] ≤ 1 N r N r i=1 w r (y r i , y)f (x r i , y) 1 N r N r i=1 w r (y r i , y) - 1 N r N r i=1 w r (y r i , y)f (x r i , y) W r (y) + 1 N r N r i=1 w r (y r i , y)f (x r i , y) W r (y) -E x∼p y,w r r (x) [f (x, y)] (S.23) Focusing on the first part of RHS of Eq.(S.23). By (A1), 1 N r N r i=1 w r (y r i , y)f (x r i , y) 1 N r N r i=1 w r (y r i , y) - 1 N r N r i=1 w r (y r i , y)f (x r i , y) W r (y) ≤ U 1 N r N r i=1 w r (y r i , y) -W r (y) W r (y) Note that ∀y, y , w r (y , y) = e -ν|y-y | 2 ≤ 1 and hence given y, w r (y , y) is a random variable bounded by 1. Apply Hoeffding's inequality to the numerator of above, yielding that with probability at least 1 -δ , 1 N r N r i=1 w r (y r i , y)f (x r i , y) 1 N r N r i=1 w r (y r i , y) - 1 N r N r i=1 w r (y r i , y)f (x r i , y) W r (y) ≤ U W r (y) 1 2N r log 2 δ . (S.24) Then, consider the second part of RHS of Eq.(S.23). Recall that p y,w r r (x) p r (x|y ) w r (y ,y)p r (y ) W r (y) dy . Thus, 1 N r N r i=1 w r (y r i , y)f (x r i , y) W r (y) -E x∼p y,w r r (x) [f (x, y)] = 1 W r (y) 1 N r N r i=1 w r (y r i , y)f (x r i , y) -E (x,y )∼pr(x,y ) [w r (y , y)f (x r i , y)] , where p r (x, y ) = p r (x|y )p r (y ) denotes the joint distribution of real image and its label. Again, since w r (y , y)f (x r i , y) is uniformly bounded by U under (A1), we can apply Hoeffding's inequality. This implies that with probability at least 1 -δ , the above can be upper bounded by U W r (y) 1 2N r log 2 δ . (S.25) Combining Eq. (S.24) and (S.25) and by setting δ = δ 2 , we have with probability at least 1 -δ, 1 N r N r i=1 w r (y r i , y)f (x r i , y) 1 N r N r i=1 w r (y r i , y) -E x∼p y,w r r (x) [f (x, y)] ≤ U W r (y) 1 2N r log 4 δ . Since this holds for ∀f ∈ F, taking supremum over f , we have sup f ∈F 1 N r N r i=1 w r (y r i , y)f (x r i , y) (S.27) Therefore, combining both Eq.(S.26) and (S.27), with probability at least 1 -δ, sup D∈D 1 N r N r i=1 w r (y r i , y) [-log D(x r i , y)] 1 N r N r i=1 w r (y r i , y) -E x∼pr(x|y) [-log D(x, y)] ≤ U W r (y) 1 2N r log 4 δ + U M r 2 E y ∼p r w (y |y) [|y -y|] . This finishes the proof. Lemma S.4. Suppose that (A1), (A3) and (A4) hold, then ∀δ ∈ (0, 1), with probability at least 1 -δ, sup D∈D 1 N g N g i=1 w g (y g i , y) [-log(1 -D(x g i , y))] 1 N g N g i=1 w g (y g i , y) -E x∼pg(x|y) [-log(1 -D(x, y))] ≤ U W g (y) 1 2N g log 4 δ + U M g 2 E y ∼p g w (y |y) [|y -y|] , (S.28) for a given y. Proof. This proof is omitted because it is almost identical to the one for Lemma S.21. As introduced in Section 2, we use KDE for the marginal label distribution with Gaussian kernel. The next theorem characterizes the difference between a p r (y), p g (y) and their KDE using n i. 

S.5.2.2 ERROR BOUNDS FOR HVDL AND SVDL

Based on the lemmas and theorems in Supp. S.5.2.1, we derive the error bounds of HVDL and SVDL, which will be used in the proofs of Theorems 1 and 2. Theorem S.4. Assume that (A1)-(A4) hold, then ∀δ ∈ (0, 1), with probability at least 1 -δ, sup D∈D L HVDL (D) -L(D) ≤ U   C KDE 1,δ log N r N r σ + L r σ 2   + U   C KDE 2,δ log N g N g σ + L g σ 2   + κU (M r + M g ) 2 + U 1 2 log 8 δ E y∼ pKDE r (y) 1 N r y,κ + E y∼ pKDE g (y) 1 N g y,κ , (S.32) for some constants C KDE 1,δ , C KDE 2,δ depending on δ. Proof. We first decompose sup D∈D L HVDL (D) -L(D) as follows for some constants C KDE 1,δ1 depending on δ 1 . 2. The second term can be bounded by using Theorem S.3 and the boundness of D and y ∈ [0, 1]. For the first term, ∀δ 2 ∈ (0, 1), with at least probability 1 -δ 2 , sup D∈D [-log(1 -D(x, y))] p g (x|y)dx (p g (y) -pKDE g (y))dy ≤U   C KDE 2,δ2 log N g N r σ + L g σ 2   , (S.34) for some constants C KDE 2,δ2 depending on δ 2 . 3. The third term can be bounded by using Lemma S.1 and S.2. For the third term, ∀δ 3 ∈ (0, 1), with at least probability 1 -δ 3 , sup D∈D 1 N r y,κ N r i=1 1 {|y-y r i |≤κ} [-log D(x r i , y)] -E x∼pr(x|y) [-log D(x, y)] pKDE r (y)dy ≤ sup D∈D 1 N r y,κ N r i=1 1 {|y-y r i |≤κ} [-log D(x r i , y)] -E x∼pr(x|y) [-log D(x, y)] pKDE r (y)dy ≤ U 1 2N r y,κ log 2 δ 3 + κU M r 2 pKDE r (y)dy Note that N r y,κ = N r i=1 1 {|y-y r i |} , which is a random variable of y i s. The above can be expressed as sup D∈D 1 N r y,κ N r i=1 1 {|y-y r i |≤κ} [-log D(x r i , y)] -E x∼pr(x|y) [-log D(x, y)] pKDE r (y)dy ≤ κU M r 2 + U 1 2 log 2 δ 3 E y∼ pKDE r (y) 1 N r y,κ (S.35) 4. Similarly, for the fourth term, ∀δ 4 ∈ (0, 1), with at least probability 1 -δ 4 , sup D∈D 1 N g y,κ N r i=1 1 {|y-y g i |≤κ} [-log(1 -D(x g i , y))] -E x∼pg(x|y) [-log(1 -D(x, y))] dx pKDE g (y)dy ≤ κU M g 2 + U 1 2 log 2 δ 4 E y∼ pKDE g (y) 1 N g y,κ . (S.36) With δ 1 = δ 2 = δ 3 = δ 4 = δ 4 , combining Eq. (S.33) -(S.36) leads to the upper bound in Theorem S.4. Theorem S.5. Assume that (A1)-(A4) hold, then ∀δ ∈ (0, 1), with probability at least 1 -δ, sup D∈D L SVDL (D) -L(D) ≤ U   C KDE 1,δ log N r N r σ + L r σ 2   + U   C KDE 2,δ log N g N g σ + L g σ 2   + U 1 2 log 16 δ 1 √ N r E y∼ pKDE r (y) 1 W r (y) + 1 √ N g E y∼ pKDE g (y) 1 W g (y) + U 2 M r E y∼ pKDE r (y) E y ∼p r w (y |y) |y -y| + M g E y∼ pKDE g (y) E y ∼p g w (y |y) |y -y| (S.37) for some constant C KDE 1,δ , C KDE 2,δ depending on δ. Proof. Similar to the decomposition for Theorem S.4, we can decompose sup D∈D L SVDL (D) -L(D) into four terms which can be bounded by using Theorem S.3, the boundness of D, Lemma S.3, and Lemma S.4. The detail is omitted because it is almost identical to the one of Theorem S.4.

S.5.2.3 PROOF OF THEOREM 1

Based on Theorem S.4, we derive Theorem 1. Proof. We first decompose L( D HVDL ) -L(D * ) as follows L( D HVDL ) -L(D * ) =L( D HVDL ) -L( D HVDL ) + L( D HVDL ) -L( D) + L( D) -L( D) + L( D) -L(D * ) (by L( D HVDL ) -L( D) ≤ 0) ≤2 sup D∈D L HVDL (D) -L(D) + L( D) -L(D * ) (by Theorem S.4) ≤2U   C KDE 1,δ log N r N r σ + L r σ 2   + 2U   C KDE 2,δ log N g N g σ + L g σ 2   + κU (M r + M g ) + 2U 1 2 log 8 δ E y∼ pKDE r (y) 1 N r y,κ + E y∼ pKDE g (y) 1 N g y,κ + L( D) -L(D * ). (S.38)

S.5.2.4 PROOF OF THEOREM 2

Based on Theorem S.5, we derive Theorem 2. Proof. The detail is omitted because it is almost identical to the one of Theorem 1 in Supp. S.5.2.3.

S.5.2.5 INTERPRETATION OF THEOREMS 1 AND 2

Both theorems imply HVDL and SVDL perform well if the output of D is not too close to 0 or 1 (i.e., favor small U ). The first two terms in both upper bounds control the quality of KDE, which implies KDE works better if we have larger N r and N g and a smaller σ. The rest terms of the two bounds are different. In the HVDL case, we favor smaller κ, M r , and M g . However, we should avoid setting κ for a too small value because we prefer larger N r y,κ and N g y,κ . In the SVDL case, we prefer small M r and M g but large W r (y) and W g (y). Large W r (y) and W g (y) imply that the weight function decays slowly (i.e., small ν; similar to large N r y,κ and N g y,κ in Eq.(S.32)). However, we should avoid setting ν too small because a small ν leads to large E y ∼ pr w (y |y) |y -y| and E y ∼ pg w (y |y) |y -y| (i.e., y 's which are far away from y have large weights). In our experiments, we use some rule-ofthumb formulae to select κ and ν. As a future work, a refined hyper-parameter selection method should be proposed. 

S.6.2 TRAINING SETUPS

The cGAN and CcGAN are trained for 6000 iterations on the training set with the Adam (Kingma & Ba, 2015) optimizer (with β 1 = 0.5 and β 2 = 0.999), a constant learning rate 5 × 10 -5 and batch size 128. The rule of thumb formulae in Section S.4 are used to select the hyper-parameters for HVDL and SVDL, where we let m κ = 2. Thus, the three hyper-parameters in this experiments are set as follows: σ = 0.074, κ = 0.017, ν = 3600. Table S.6.1: Network architectures for the generator and discriminator of cGAN in the simulation. "fc" denotes a fully-connected layer. "BN" stands for batch normalization. The label y is treated as a class label and encoded by label-embedding (Akata et al., 2015) so its dimension equals to the number of distinct angles in the training set (i.e., y ∈ R 120 ). (a) Generator The label y is treated as a real scalar so its dimension is 1. We do not directly input y into the generator and discriminator. We first convert each y into the coordinate of the mean represented by this y, i.e., (sin(y), cos(y)). Then we insert this coordinate into the networks. As the number of Gaussians decreases, the continuous scenario gradually degenerates to the categorical scenario, therefore the assumption that a small perturbation to y results in a negligible change to p(x|y) is no longer satisfied. Consequently, the 2-Wasserstein distances of the proposed two CcGAN methods gradually increase and eventually surpass the 2-Wasserstein distance of cGAN when the number of Gaussians is small (e.g., less than 40). Note that reducing the number of Gaussians in the training data generation will not improve the performance of cGAN in the testing because many angles seen in the testing stage (we evaluate each method on 360 angles) do not appear in the training set. To generate RC-49, firstly we randomly select 49 3-D chair object models from the "Chair" category provided by ShapeNet (Chang et al., 2015) . Then we use Blender v2.79foot_0 to render these 3-D models. Specifically, during the rendering, we rotate each chair model along with the yaw axis for a degree between 0.1 • and 89.9 • (angle resolution as 0.1 • ) where we use the scene image mode to compose our dataset. The rendered images are converted from the RGBA to RGB color model. In total, the RC-49 dataset consists of 44051 images of image size 64×64 in the PNG format. z ∈ R 2 ∼ N (0, I); y ∈ R 120 concat(z, y) ∈ R

S.7.2 NETWORK ARCHITECTURES

The RC-49 dataset is a more sophisticated dataset compared with the simulation, thus it requires networks with deeper layers. We employ the SNGAN architecture (Miyato et al., 2018) in both cGAN and CcGAN consisting of residual blocks for the generator and the discriminator. Moreover, for the generator in cGAN, the regression labels are input into the network by the label embedding (Akata et al., 2015) and the conditional batch normalization (De Vries et al., 2017) . For the discriminator in cGAN, the regression labels are fed into the network by the label embedding and the label projection (Miyato & Koyama, 2018) . For CcGAN, the regression labels are fed into networks by our proposed label input method in Section 2. Please refer to our codes for more details about the network specifications of cGAN and CcGAN.

S.7.3 TRAINING SETUPS

The cGAN and CcGAN are trained for 30,000 iterations on the training set with the Adam (Kingma & Ba, 2015) optimizer (with β 1 = 0.5 and β 2 = 0.999), a constant learning rate 10 -4 and batch size 256. The rule of thumb formulae in Section S.4 are used to select the hyper-parameters for HVDL and SVDL, where we let m κ = 2. Thus, the three hyper-parameters in this experiments are set as follows: σ = 0.0473, κ = 0.004, ν = 50625.

S.7.4 TESTING SETUPS

The RC-49 dataset consists of 899 distinct yaw angles and at each angle there are 49 images (corresponding to 49 types of chairs). At the test stage, we ask the trained cGAN or CcGAN to generate 200 fake images at each of these 899 yaw angles. Please note that, among these 899 yaw angles, only 450 of them are seen at the training stage so real images at the rest 449 angles are not used in the training. We evaluate the quality of the fake images from three perspectives, i.e., visual quality, intra-label diversity, and label consistency. One overall metric (Intra-FID) and three separate metrics (NIQE, Diversity, and Label Score) are used. Their details are shown in Supp. S.7.5.

S.7.5 PERFORMANCE MEASURES

Before we conduct the evaluation in terms of the four metrics, we first train an autoencoder (AE) , a regression CNN and a classification CNN on all real images in RC-49. The bottleneck dimension of the AE is 512 and the AE is trained to reconstruct the real images in RC-49 with MSE as the loss function. The regression CNN is trained to predict the yaw angle of a given image. The classification CNN is trained to predict the chair type of a given image. The autoencoder and both two CNNs are trained for 200 epochs with a batch size 256. • Intra-FID (Miyato & Koyama, 2018) : We take Intra-FID as the overall score to evaluate the quality of fake images and we prefer the small Intra-FID score. At each evaluation angle, we compute the FID (Heusel et al., 2017) between 49 real images and 200 fake images in terms of the bottleneck feature of the pre-trained AE. The Intra-FID score is the average FID over all 899 evaluation angles. Please note that we also try to use the classification CNN to compute the Intra-FID but the Intra-FID scores vary in a very wide range and sometimes obviously contradict with the three separate metrics. • NIQE (Mittal et al., 2012) : NIQE is used to evaluate the visual quality of fake images with the real images as the reference and we prefer the small NIQE score. We train one NIQE model with the 49 real images at each of the 899 angles so we have 899 NIQE models. During evaluation, a NIQE score is computed for each evaluation angle based on the NIQE model at that angle. Finally, we report the average and standard deviations of the 899 NIQE scores over the 899 yaw angels. Note that the NIQE is implemented by the NIQE module in MATLAB. • Diversity: Diversity is used to evaluate the intra-label diversity and the larger the better. In RC-49, there are 49 chair types. At each evaluation angle, we ask the pre-trained classification to predict the chair types of the 200 fake images and an entropy is computed based on these predicted chair types. The diversity reported in Table 2 is the average of the 899 entropies over all evaluation angles. • Label Score: Label Score is used to evaluate the label consistency and the smaller the better. We ask the pre-trained regression CNN to predict the yaw angles of all fake images and the predicted angles are then compared with the assigned angles. The Label Score is defined as the average absolute distance between the predicted angles and assigned angles over all fake images, which is equivalent to the Mean Absolute Error (MAE). 

LABEL

To study how the strength of the correlation between the image x and its label y (i.e., the label power) influences the performance of cGAN and CcGAN, in this study, we randomly add Gaussian noises with a preset standard deviation to the raw regression labels in the training set. The strength of the correlation is controlled by the standard deviation of the Gaussian noises which varies from First, (P1) is still unsolved because DiffAugment does not provide a solution better than binning the regression labels into a series of disjoint intervals to tackle the problem that some regression labels do not exist in the training set. Second, since DiffAugment is designed for the unconditional and class-conditional scenarios where the number of distinct conditions is always finite and known, DiffAugment doesn't provide a solution to (P2). Besides these two unsolved problems, another concern of DiffAugment in the continuous scenario is that the ordinal information in the regression labels is not utilized while our CcGAN implicitly uses this ordinal information to construct the soft/hard vicinity. S.9 POTENTIAL APPLICATIONS AND IMPACTS OF CCGANS Generally, there are three label scenarios where we can apply CcGANs: Scenario I, mathematically continuous labels (e.g., angles); Scenario II, discrete but ordinal labels (e.g., ages); and Scenario III, discrete, categorical labels but sharing close relationships among different label categories (e.g., finegrained bird image generation). CcGANs can have potential applications in all three scenarios. For example, in Scenario I, CcGANs could have potential impacts on autonomous driving which involves predicting the steering angle ( a continuous scalar) to have better controllability over autonomous cars. In Scenario II, the proposed methods are potentially meaningful in some medical applications. E.g., in medical experiments, an important task is cell counting, where the cell counting regression needs to predict the number of cells (i.e., ordinal integers) from a microscopic image. Even with limited microscopic cell images, the proposed CcGAN can generate visually synthetic and diverse microscopic images for the regression model training. In this way, CcGAN may help save tedious efforts of medical reseachers in gathering microscopic images. In Scenario III, as suggested by AnonReviewer 5 (Q3), CcGAN could be used on some fine-grained image classification datasets, e.g., on the bird dataset where birds of different categories may share close similarities. The generated bird images can be used to enhance the fine-grained bird image classifiers, and potentially help us better recognize birds and protect the environment. More generally, CcGANs can be potentially used for image generation in regression datasets (associated with scalar labels y). In summary, CcGANs can cover a wide range of tasks and applications which could potentially benefit the society.



https://www.blender.org/download/releases/2-79/ https://github.com/mit-han-lab/data-efficient-gans



are defined as:L(D) = -E y∼pr(y) E x∼pr(x|y) [log (D(x, y))] -E y∼pg(y) E x∼pg(x|y) [log (1 -D(x, y))]

Figure 1: HVE (Eq. (6)) and SVE (Eq. (7)) estimate p(x|y) (a univariate Gaussian conditional on y) using two samples in hard and soft vicinities, respectively, of y. To estimate p(x|y) (the red Gaussian curve) only from samples drawn from p(x|y 1 ) and p(x|y 2 ) (the blue Gaussian curves), estimation is based on the samples (red dots) in a hard vicinity (defined by y ± κ) or a soft vicinity (defined by the weight decay curve) around y. The histograms in blue are samples in the hard or soft vicinity. The labels y 1 , y, and y 2 on the x-axis denote the means of x conditional on y 1 , y, and y 2 , respectively.

which minimizes L but may not be in D. Let D arg min D∈D L(D). Let D HVDL arg min D∈D L HVDL (D); similarly, we define D SVDL . Definition 1. (Hölder Class) Define the Hölder class of functions

(A1) All D's in D are measurable and uniformly bounded by U . Let U max{sup D∈D [-log D] , sup D∈D [-log(1 -D)]} and U < ∞; (A2) For ∀x ∈ X and y, y ∈ Y, ∃g r (x) > 0 and M r > 0, s.t. |p r (x|y ) -p r (x|y)| ≤ g r (x)|y -y| with g r (x)dx = M r ; (A3) For ∀x ∈ X and y, y ∈ Y, ∃g g (x) > 0 and M g > 0, s.t. |p g (x|y ) -p g (x|y)| ≤ g g (x)|y -y| with g g (x)dx = M g ; (A4) p r (y) ∈ Σ(L r ) and p g (y) ∈ Σ(L g ).

Figure 2: Visual results for the Circular 2-D Gaussians simulation. (a) shows 1,200 training samples from 120 Gaussians, with 10 samples per Gaussian. In (b) to (d), each GAN generates 100 fake samples at each of 12 means not appearing in the training set, where green and blue dots stand for fake and real samples respectively.

Figure3: Three RC-49 example images for each of 10 angles: real images and example fake images from cGAN and two proposed CcGANs, respectively. CcGANs produce chair images with higher visual quality and more diversity.

Figure4: Three UTKFace example images for each of 10 ages: real images and example fake images from cGAN and two proposed CcGANs, respectively. CcGANs produce face images with higher visual quality and more diversity.

Figure 5: Line graphs of FID/NIQE versus regression labels on RC-49 and UTKFace. Figs. 5(a) to 5(d) show that two CcGANs consistently outperform cGAN across all regression labels. The graphs of CcGANs also appear smoother than those of cGAN because of HVDL and SVDL.

Please find the codes for this paper at Github: https://github.com/UBCDingXin/improved_CcGAN S.2 ALGORITHMS FOR CCGAN TRAINING Algorithm 1: An algorithm for CcGAN training with the proposed HVDL. Data: N r real image-label pairs Ω r = {(x r 1 , y r 1 ), . . . , (x r N r , y r N r )}, N r uy ordered distinct labels Υ = {y r [1] , . . . , y r [N r uy ] } in the dataset, preset σ and κ, number of iterations K, the discriminator batch size m d , and the generator batch size m g . Result: Trained generator G. for k = 1 to K do Train D; Draw m d labels Y d with replacement from Υ; Create a set of target labels Y d, = {yi + |yi ∈ Y d , ∈ N (0, σ 2 ), i = 1, . . . , m d } (D training is conditional on these labels) ; Initialize Ω r d = φ, Ω f d = φ; for i = 1 to m d do Randomly choose an image-label pair (x, y) ∈ Ω r satisfying |y -yi -| ≤ κ where yi + ∈ Y d, and let Ω r d = Ω r d ∪ (x, yi + ). ; Randomly draw a label y from U (yi + -κ, yi + + κ) and generate a fake image x by evaluating G(z, y ), where z ∼ N (0, I). Let Ω f d = Ω f d ∪ (x , yi + ). ; end Update D with samples in set Ω r d and Ω f d via gradient-based optimizers based on Eq.(6); Train G; Draw m g labels Y g with replacement from Υ;

An algorithm for CcGAN training with the proposed SVDL. Data: N r real image-label pairs Ω r = {(x r 1 , y r 1 ), . . . , (x r N r , y r N r )}, N r uy ordered distinct labels Υ = {y r [1] , . . . , y r [N r uy ] } in the dataset, preset σ and ν, number of iterations K, the discriminator batch size m d , and the generator batch size m g . Result: Trained generator G. for k = 1 to K do Train D; Draw m d labels Y d with replacement from Υ; Create a set of target labels Y

Figure S.3.6: The label input method for the generator in CcGAN.

y) p r (x, y) + p g (x, y) . (S.15) However, D * may not be covered by the hypothesis space D. The D is the minimizer of L in the hypothesis space D. Thus, L( D) -L(D * ) should be a non-negative constant. In CcGAN, we minimize L HVDL (D) or L HVDL (D) with respect to D ∈ D, so we are more interested in the distance of D HVDL and D SVDL from D * , i.e., L( D HVDL ) -L(D * ) and L( D SVDL ) -L(D * ). S.5.2 PROOFS OF THEOREMS 1 AND 2 S.5.2.1 TECHNICAL LEMMAS Before we move to the proofs of Theorems 1 and 2, we provide several technical lemmas used in the later proof.

For the second term, by the definition of p y,κ r (x) and defining p κ (y ) = 1 {|y -y|≤κ} p(y ) 1 {|y -y|≤κ} p(y )dy , we havesup D∈D E x∼p y,κ r (x) [-log D(x, y)] -E x∼pr(x|y) [-log D(x, y)](by the definition of total variation and the boundness of -log D)≤ U 2 |p y,κ r(x) -p r (x|y)| dx. (S.18) Then, focusing on |p y,κ r (x) -p r (x|y)|, |p y,κ r (x) -p r (x|y)| = p(x|y )p κ (y )dy -p(x|y) ≤ |p(x|y ) -p(x|y)| p κ (y )dy (by (A2)) ≤ g r (x)|y -y|p κ (y )dy ≤ κg r (x). Thus, Eq. (S.18) is upper bounded as follows, sup D∈D E x∼p y,κ r (x) [-log D(x, y)] -E x∼pr(x|y) [-log D(x, y)] ≤ κg r (x)dx (by (A2)) = κM r . (S.19)

the second term on the RHS of Eq.(S.22). By (A1) that |f | < U , sup f ∈F E x∼p y,w r r (x) [f (x, y)] -E x∼pr(x|y) [f (x, y)] ≤ U p y,w r r (x) -p r (x|y) -p r (x|y)|dx. Note that by the definition of p y,w r r (x) p r (x|y ) w r (y ,y)pr(y ) W r (y) dy and p r w (y |y) w r (y ,y)p r (y ) -p r (x|y)| = p r (x|y )p r w (y |y) dy -p r (x|y) ≤ |p r (x|y ) -p r (x|y)| p r w (y |y) dy . By (A.2) and y ∈ [0, 1], the above is upper bounded by g r (x)E y ∼p r w (y |y) [|y -y |]. Thus, sup f ∈F E x∼p y,w r r (x) [f (x, y)] -E x∼pr(x|y) [f (x, y)] ≤ U 2 g r (x)E y ∼p r w (y |y) [|y -y|] dx = U M r 2 E y ∼p r w (y |y) [|y -y|] .

and c, where C depends on δ and c = L K(s)|s| 2 ds. Since in this work, K is chosen as Gaussian kernel, c = L K(s)|s| 2 ds = L.

D(x, y)] p r (x|y)dx (p r (y) (1 -D(x, y))] p g (x|y)dx (p g (y) |≤κ} [-log D(x r i , y)] -E x∼pr(x|y) [-log D(x, y)] pKDE r y g i |≤κ} [-log(1 -D(x g i , y))] -E x∼pg(x|y) [-log(1 -D(x, y))] pKDE g (y)dy . These four terms in the RHS can be bounded separately as follows 1. The first term can be bounded by using Theorem S.3 and the boundness of D and y ∈ [0, 1]. For the first term, ∀δ 1 ∈ (0, 1), with at least probability 1 -δ 1 , sup D∈D [-log D(x, y)] p r (x|y)dx (p r (y) -pKDE r

z ∈ R 2 ∼ N (0, I); y ∈ R concat(z, sin(y), cos(y)) ∈ R 4 fc→ 100; BN; ReLU fc→ 100; BN; ReLU fc→ 100; BN; ReLU fc→ 100; BN; ReLU fc→ 100; BN; ReLU fc→ 100; BN; ReLU fc→ 2 (b) Discriminator A sample x ∈ R 2 with label y ∈ R concat(x, sin(y), cos(y)) ∈ R 4 fc→ 100; ReLU fc→ 100; ReLU fc→ 100; ReLU fc→ 100; ReLU fc→ 100; ReLU fc→ 1; Sigmoid S.6.3 TESTING SETUPS When evaluating the trained cGAN, if a test label y is unseen in the training set, we first find its closest, seen label y. Then, we generate samples from the trained cGAN at y instead of at y . On the contrary, generating samples from CcGAN at unseen labels is well-defined. S.6.4 EXTRA EXPERIMENTS S.6.4.1 VARYING NUMBER OF GAUSSIANS FOR TRAINING DATA GENERATION In this section, we study the influence of the number of Gaussians used for training data generation on the performance of cGAN and CcGAN. We vary the number of Gaussians from 120 to 10 with step size 10 but keep other settings in Section 4.1 unchanged and plot the line graphs of 2-Wasserstein Distance (log scale) versus the number of Gaussians in Fig. S.6.1. Reducing the number of Gaussians for training implies a larger gap between any two consecutive distinct angles in the training set.

Figure S.6.1: Line graphs of 2-Wasserstein Distance (log scale) versus the number of Gaussians for training data generation.As the number of Gaussians decreases, the continuous scenario gradually degenerates to the categorical scenario, therefore the assumption that a small perturbation to y results in a negligible change to p(x|y) is no longer satisfied. Consequently, the 2-Wasserstein distances of two CcGAN methods gradually increase and eventually surpass the 2-Wasserstein distance of cGAN when the number of Gaussians is small (e.g., less than 40).

we present some interpolation results of the two CcGAN methods (i.e., HVDL and SVDL). For an input pair (z, y), we fix the noise z but perform label-wise interpolations, i.e., varying label y from 4.5 to 85.5. Clearly, all generated images are visually realistic and we can see the chair distribution smoothly changes over continuous angles. Please note that, Fig. S.7.3 is meant to show the smooth change of the chair distribution instead of one single chair so the chair type may change over angles. This confirms CcGAN is capable of capturing the underlying conditional image distribution rather than simply memorizing training data.

Figure S.7.2: Line graphs of FID/NIQE/Diversity versus yaw angles on RC-49. Figs. S.7.2(a) to S.7.2(c) show that two CcGANs consistently outperform cGAN across all angles. The graphs of CcGANs also appear smoother than those of cGAN because of HVDL and SVDL. 4.5 13.5 22.5 31.5 40.5 49.5 58.5 67.5 76.5 85.5

Figure S.7.4: Some example RC-49 fake images from a degenerated CcGAN.

Figure S.7.5: Example RC-49 fake images from cGAN when we bin the yaw angle range into different number of classes.

Figure S.7.6: Line graphs of Intra-FID versus the sample size for each distinct training angle. The grey vertical dashed line stands for the sample size used in the main study of the RC-49 experiment in Section 4.2. Two CcGAN methods substantially outperform cGAN no matter what the sample size for each distinct angle in the training set. The overall trend in this figure shows that a smaller sample size deteriorates the performance of both cGAN and CcGAN.

Figure S.7.7: Line graphs of Intra-FID versus the standard deviation of Gaussian noise. The overall trend in the figure shows that the performance of two CcGAN methods deteriorate as the standard deviation increases.

Figure S.8.8: The histogram of UTKFace dataset with ages range from 1 to 60.

Figure S.8.9: Line graphs of FID/NIQE/Diversity versus ages on UTKFace. Figs. S.8.9(a) to S.8.9(c) show that two CcGANs consistently outperform cGAN across almost all ages. The graphs of CcGANs also appear smoother than those of cGAN because of HVDL and SVDL.

Figure S.8.10: Some examples of generated UTKFace images from CcGAN when the discriminator is trained with HVDL and SVDL. We fix the noise z but vary the label y from 3 to 57.

Figure S.8.12: Example UTKFace fake images from cGAN when we bin the age range into different number of classes.

The histogram in Fig. S.8.8 shows that the UTKFace dataset is highly imbalanced. To balance the training data and also test the performance of cGAN and CcGAN under smaller sample sizes, we vary the maximum sample size for each distinct age in the training from 200 to 50. Note that, in the main study in Section 4.3, we do not restrict the maximum sample size. Since we have a much smaller sample size, we reduce the number of iterations for the GAN training from 40,000 to 20,000 and slightly increase m κ in Supp. S.4 from 1 to 2 (we therefore use a wider hard/soft vicinity). We visualize the line graphs of Intra-FID versus the maximum sample size for each age of cGAN and CcGAN in Fig. S.8.13. From the figure, we can clearly see that a smaller sample size worsens the performance of both cGAN and CcGAN. Moreover, the Intra-FID scores of cGAN always stay at a high level and are much larger than those of two CcGAN methods.

Figure S.8.13: Line graphs of Intra-FID versus the maximum sample size for each distinct angle in the training set.

Figure S.8.14: Some example UTKFace fake images from cGAN+DiffAugment. Even with the help of DiffAugment, cGAN still has poor visual quality in the continuous scenario.

First, without loss of generality, we assume y ∈ [0, 1]. Then, we introduce some notations. Let D stand for the Hypothesis Space of D. Let pKDE

Average quality of 179,800 fake RC-49 images from cGAN and CcGAN with standard deviations after the "±" symbol. "↓" ("↑") indicates lower (higher) values are preferred.In this section, we compare CcGAN and cGAN on UTKFace(Zhang et al., 2017), a dataset consisting of RGB images of human faces which are labeled by age.Experimental setup: In this experiment, we only use images with age in[1, 60]. Some images with bad visual quality and watermarks are also discarded. After the preprocessing, 14,760 images are left. The number of images for each age ranges from 50 to 1051. We resize all selected images to 64 × 64. Some example UTKFace images are shown in the first image array in Fig.4.

Under condition (A4), if the KDEs are based on n i.i.d. samples from p r /p g and a bandwidth σ, for all δ ∈ (0, 1), with probability at least 1 -δ,

DETAILS OF THE SIMULATION IN SECTION 4.1 S.6.1 NETWORK ARCHITECTURES Please refer to Table S.6.1 and Table S.6.2 for the network architectures we adopted for cGAN and CcGAN in our Simulation experiments.

6.2: Network architectures for the generator and discriminator of our proposed CcGAN in the simulation.

S.8.2 NETWORK ARCHITECTURESThe network architectures used in this experiment is similar to those in the RC-49 experiment. Please refer to our codes for more details about the network specifications.S.8.3 TRAINING SETUPSThe cGAN and CcGAN are trained for 40,000 iterations on the training set with the Adam (Kingma & Ba, 2015) optimizer (with β 1 = 0.5 and β 2 = 0.999), a constant learning rate 10 -4 and batch size 512. The rule of thumb formulae in Section S.4 are used to select the hyper-parameters for HVDL and SVDL, where we let m κ = 1.S.8.4 PERFORMANCE MEASURESSimilar to the RC-49 experiment, we evaluate the quality of fake images by Intra-FID, NIQE, Diversity, and Label Score. We also train an AE (bottleneck dimension is 512), a classification CNN, and a regression CNN on all images. Please note that, the UTKFace dataset consists of face images from 5 races based on which we train the classification CNN. The AE and both two CNNs are trained for 200 epochs with a batch size 256.

Table S.8.3. From Table S.8.3, we can see two CcGAN methods are still much better than cGAN in terms of all metrics except Label Score, since CcGAN is designed to sacrifice some (not too much) label consistency for much better visual quality and diversity. Table S.8.3: Training cGAN and CcGAN on images for odd ages only and testing them on evennumbered ages. CcGAN (HVDL) 0.724 ± 0.161 1.795 ± 0.230 1.133 ± 0.257 10.341 ± 3.931 CcGAN (SVDL) 0.777 ± 0.248 1.803 ± 0.214 1.257 ± 0.112 13.141 ± 5.862 S.8.5.6 TRAINING WITH SMALLER SAMPLE SIZES

To support our arguments, we incorporate DiffAugment into the cGAN training in the UTKFace experiment while other settings are kept constant. When implementing DiffAugment, we use the official codes from the GitHub repository of DiffAugment 2 . The strongest transformation combination (Color + Translation + Cutout) is used in the cGAN training. Quantitative results from cGAN+DiffAugment are summarized in TableS.8.4 and some example images from cGAN+DiffAugment are shown in Fig.S.8.14. The quantitative results show that DiffAugment substantially improves the visual quality and diversity of the baseline cGAN; however, the performance of cGAN+DiffAugment is still much worse than that of the proposed two CcGAN methods. The visual results also support the quantitative evaluations. Therefore, cGAN+DiffAugment still does not solve the two fundamental problems in the continuous scenario, since it is not designed for this purpose. TableS.8.4: Average quality of 60,000 fake UTKFace images from cGAN and CcGAN with standard deviations after the "±" symbol. "↓" ("↑") indicates lower (higher) values are preferred. cGAN (60 classes) 4.516 ± 0.965 2.315 ± 0.306 0.254 ± 0.353 11.087 ± 8.119 cGAN (60 classes) + DiffAugment 1.328 ± 0.156 2.077 ± 0.245 1.102 ± 0.183 11.212 ± 8.329 CcGAN (HVDL) 0.572 ± 0.167 1.739 ± 0.145 1.338 ± 0.178 9.782 ± 7.166 CcGAN (SVDL) 0.547 ± 0.181 1.753 ± 0.196 1.326 ± 0.198 10.739 ± 8.340

ACKNOWLEDGMENTS

This work was supported by the Natural Sciences and Engineering Research Council of Canada (NSERC) under Grants CRDPJ 476594-14, RGPIN-2019-05019, and RGPAS2017-507965. 

