DEEP GENERATIVE MODELING ON LIMITED DATA WITH REGULARIZATION BY NONTRANSFERABLE PRE-TRAINED MODELS

Abstract

Deep generative models (DGMs) are data-eager because learning a complex model on limited data suffers from a large variance and easily overfits. Inspired by the classical perspective of the bias-variance tradeoff, we propose regularized deep generative model (Reg-DGM), which leverages a nontransferable pre-trained model to reduce the variance of generative modeling with limited data. Formally, Reg-DGM optimizes a weighted sum of a certain divergence and the expectation of an energy function, where the divergence is between the data and the model distributions, and the energy function is defined by the pre-trained model w.r.t. the model distribution. We analyze a simple yet representative Gaussian-fitting case to demonstrate how the weighting hyperparameter trades off the bias and the variance. Theoretically, we characterize the existence and the uniqueness of the global minimum of Reg-DGM in a non-parametric setting and prove its convergence with neural networks trained by gradient-based methods. Empirically, with various pretrained feature extractors and a data-dependent energy function, Reg-DGM consistently improves the generation performance of strong DGMs with limited data and achieves competitive results to the state-of-the-art methods. Our implementation is available at https://github.com/ML-GSAI/Reg-ADA-APA.

1. INTRODUCTION

Deep generative models (DGMs) (Kingma & Welling, 2013; Goodfellow et al., 2014; Sohl-Dickstein et al., 2015; Van den Oord et al., 2016; Dinh et al., 2016; Hinton & Salakhutdinov, 2006) employ neural networks to capture the underlying distribution of high-dimensional data and find applications in various learning tasks (Kingma et al., 2014; Zhu et al., 2017; Razavi et al., 2019; Ramesh et al., 2021; 2022; Ho et al., 2022) . Such models are often data-eager (Li et al., 2021; Wang et al., 2018) due to the presence of complex function classes. Recent work (Karras et al., 2020a) found that the classical variants of generative adversarial networks (GANs) (Goodfellow et al., 2014; Karras et al., 2020b) produce poor samples with limited data, which is shared by other DGMs in principle. Thus, improving the sample efficiency is a common challenge for DGMs. The root cause of the problem is that learning a model in a complex class on limited data suffers from a large variance and easily overfits the training data (Mohri et al., 2018) . To relieve the problem, previous work either employed sophisticated data augmentation strategies (Zhao et al., 2020a; Karras et al., 2020a; Jiang et al., 2021) , or designed new losses for the discriminator in GANs (Cui et al., 2021; Yang et al., 2021) , or transferred a pre-trained DGM (Wang et al., 2018; Noguchi & Harada, 2019; Mo et al., 2020) . Although not pointed out in the literature to our knowledge, prior work can be understood as reducing the variance of the estimate implicitly (Mohri et al., 2018) . In this perspective, we propose a complementary framework, named regularized deep generative model (Reg-DGM), which employs a pre-trained model as regularization to achieve a better bias-variance tradeoff when training a DGM with limited data. In Sec. 2, we formulate the objective function of Reg-DGM as the sum of a certain divergence and a regularization term weighted by a hyperparameter. The divergence is between the data distribution and model distribution, and the regularization term can be understood as the negative expected loglikelihood of an energy-based model, whose energy function is defined by a pre-trained model, w.r.t. the model distribution. Intuitively, with an appropriate weighting hyperparameter, Reg-DGM balances between the data distribution and the pre-trained model to achieve a better bias-variance tradeoff with limited data, as validated by a simple yet prototypical Gaussian-fitting example. In Sec. 3, we characterize the optimization behavior of Reg-DGM in both non-parametric and parametric settings under mild regularity conditions. On one hand, we prove the existence and the uniqueness of the global minimum of the regularized optimization problem with the Kullback-Leibler (KL) and Jensen-Shannon (JS) divergence in the non-parametric setting. On the other hand, we prove that, parameterized by a standard neural network architecture, Reg-DGM converges with a high probability to a global (or local) optimum trained by gradient-based methods. In Sec. 4, we specify the components in Reg-DGM. We employ strong variants of GANs (Karras et al., 2020b; a; Jiang et al., 2021) as the base DGM for broader interests. We consider a nontransferable setting where the pre-trained model does not necessarily have to be a generative model. Indeed, we employ several feature extractors, which are trained for non-generative tasks such as image classification, representation learning, and face recognition. Notably, such models cannot be directly used in the fine-tuning approaches (Wang et al., 2018; Mo et al., 2020) . With a comprehensive ablation study, we define our default energy function as the expected mean squared error between the features of the generated samples and the training data. Such an energy not only fits our theoretical analyses, but also results in consistent and significant improvements over baselines in all settings. In Sec. 5, we present experiments on several benchmarks, including the FFHQ (Karras et al., 2019) , LSUN CAT (Yu et al., 2015) , and CIFAR-10 ( Krizhevsky et al., 2009) datasets. We compare Reg-DGM with a large family of methods, including the base DGMs (Karras et al., 2020b; a; Jiang et al., 2021) , the transfer-based approaches (Wang et al., 2018; Mo et al., 2020) , the augmentation-based methods (Zhao et al., 2020a) and others (Cui et al., 2021; Yang et al., 2021) . With a classifier pretrained on ImageNet (Deng et al., 2009) , or an image encoder pre-trained on the CLIP dataset (Radford et al., 2021) , or a face recognizer pre-trained on VGGFace2 (Cao et al., 2018) , Reg-DGM consistently improves strong DGMs under commonly used performance measures with limited data and achieves competitive results to the state-of-the-art methods. Our results demonstrate that Reg-DGM can achieve a good bias-variance tradeoff in practice, which supports our motivation. In Sec. 6, we present the related work. In Sec. 7, we conclude the paper and discuss limitations.

2. METHOD

The goal of generative modeling is to learn a model distribution p g (implicitly or explicitly) from a training set S = {x i } m i=1 of size m on a sample space X . The elements in S are assumed to be drawn i.i.d. according to an unknown data distribution p d ∈ P X , where P X is the set of all valid distributions over X . A general formulation for learning generative models is minimizing a certain statistical divergence D(•||•) between the two distributions as follows: min pg∈H D(p d ||p g ), where H ⊂ P X is the hypothesis class, for instance, a set of distributions defined by neural networks in a deep generative model (DGM). Notably, the divergence in Eq. ( 1) is estimated by the Monte Carlo method over the training set S and its solution has a small bias if not zero. However, learning a DGM with limited data is challenging because solving the problem in Eq. ( 1) with a small sample size m essentially suffers from a large variance and probably overfits (Mohri et al., 2018) . Inspired by the classical perspective of the bias-variance tradeoff, we propose to leverage an external model f pre-trained on a related and large dataset (e.g., ImageNet) as a data-dependent regularization to reduce the variance of training a DGM on limited data (e.g., CIFAR-10), which is complementary to prior work (see more details in Sec. 6). In particular, given a pre-trained f , we first introduce an energy function (LeCun et al., 2006 ) E f : X → R that satisfies mild regularity conditions (see Assumption A.1) and then define a probabilistic energy-based model (EBM) p f as p f (x) ∝ exp(-E f (x)). We can treat p f as a special estimate of p d if f is pre-trained on a dataset closely related to p d . In this perspective, p f has a potentially large bias yet a variance of zero. To trade off the bias and variance, we simply optimize a weighted sum of the statistical divergence as in Eq. ( 1) and the expected log-likelihood of the EBM as follows: min pg∈H D(p d ||p g ) -λE x∼pg [log p f (x)] ⇔ min pg∈H D(p d ||p g ) + λE x∼pg [E f (x)], where λ > 0 is the weighting hyperparameter balancing the two terms. The first one encourages p g to fit the data as in the original DGM and the second one encourages p g to produce samples with a high likelihood evaluated by the EBM p f . Naturally, a properly selected λ can hopefully achieve a better bias-variance tradeoff. We refer to our approach as regularized deep generative model (Reg-DGM) when H consists of distributions defined by neural networks. In the following, we analyze a simple yet prototypical Gaussian-fitting example to demonstrate how a regularization term defined by a pre-trained model can relieve the bias-variance dilemma in generative modeling. Such an example is helpful to illustrate the motivation of Reg-DGM precisely and provide valuable insights on its practical performance. In the same spirit, representative prior work (Arjovsky et al., 2017; Mescheder et al., 2018) in the literature of DGMs has investigated similar examples with several parameters and closed-form solutions.

2.1. A PROTOTYPICAL GAUSSIAN-FITTING EXAMPLE

Example 2.1 (Gaussian-fitting example). The data distribution is a (univariate) Gaussian p d (x) = N (x|µ * , σ 2 ) , where σ 2 is known and µ * is the parameter to be estimated. A training sample S = {x i } m i=1 is drawn i.i.d. according to p d (x). The hypothesis class for p g is H = {N (x|µ, σ 2 ) | µ ∈ R}. The regularization term in Eq. (2) is E f (x) := -log N (μ PRE , σ 2 ), i.e., p f (x) = N (x|μ PRE , σ 2 ). Note that we let μPRE ̸ = µ * and their gap |μ PRE -µ * | can be large in general. For simplicity, we consider the classical maximum likelihood estimation (MLE) (i.e., using the KL divergence in Eq. (1) as the performance measure), and its solution for Example 2.1 is given by the sample mean (Bishop & Nasrabadi, 2006)  : μMLE = 1 m m i=1 x i , μMLE ∼ N µ * , 1 m σ 2 . The pre-trained model μPRE is another meaningful baseline for our approach. Clearly, it has a bias of μPRE -µ * and a zero variance. Formally, the solution of our approach based on MLE is μREG =foot_0 1+λ μMLE + λ 1+λ μPRE , μREG ∼ N 1 1+λ µ * + λ 1+λ μPRE , σ 2 m(1+λ) 2 . We compare all estimators under the mean squared error (MSE) 1 , which is a common measure in statistics and machine learning. Formally, the MSE of an estimator θ w.r.t. an unknown parameter θ is defined as: , 2006) . As summarized in Proposition 2.2, our method achieves a better bias-variance tradeoff than the baselines if λ is in an appropriate range. Proposition 2.2. Let β = λ λ+1 be the normalized weight of the regularization term. In the Gaussian-fitting example 2.1, if max σ 2 -m(μPRE-µ * ) 2 σ 2 +m(μPRE-µ * ) 2 , 0 < β < min 2σ 2 σ 2 +m(μPRE-µ * ) 2 , 1foot_1 , then the following inequalities holds: MSE[μ REG ] < min{MSE[μ MLE ], MSE[μ PRE ]}. (3) Due to space limit, the proof is deferred to Appendix A.1.1. We plot the MSE curves for all estimators w.r. Since the generalization analysis in deep learning is still largely open (Belkin, 2021; Bartlett et al., 2021) , it is difficult to generalize our Proposition 2.2 to the cases with deep models due to lack of appropriate tools. However, the intuition behind the Gaussian example also holds in deep learning. Namely, training a model on limited data suffers from a large variance (overfitting) and using a pretrained model suffers from a large bias (underfitting). Thus, Reg-DGM with a proper hyperparameter can balance between them and potentially achieve a better performance. Further, according to Fig. 1 (c), Reg-DGM benefits if the EBM defined by the pre-trained model is close to the target distribution, which inspires a data-dependent energy function presented in Sec. 4.

3.1. ANALYSES IN THE NON-PARAMETRIC SETTING

We assume that our hypothesis class contains all valid distributions (i.e. H = P X ), and the data distribution p d is accessible. Although the setting is impractical, such analyses characterize the existence and uniqueness of the global minimum in an ideal case and have been widely considered in deep generative models (Goodfellow et al., 2014; Arjovsky et al., 2017) . Further, it is insightful to see how the regularization affects the solution of Reg-DGM in the ideal case. Built upon the classical recipe of the calculus of variations and properties of the KL divergence in the topology of weak convergence, we establish our theory on the existence and uniqueness of the global minimum of Eq. ( 2) with the KL and JS divergence 3 . The results are formally characterized in Theorem 3.1 and Theorem 3.2 respectively. We refer the readers to Appendix A.2.1 for the proofs. To establish our theory, we assume that (1) X is a nonempty compact set; (2) E f : X → R is continuous and bounded; (3) X e -E f (x) dx < ∞, which are common and mild. Theorem 3.1. Under mild regularity conditions in Assumption A.1, for any λ > 0, there exists a unique global minimum of the problem in Eq. ( 2) with the KL divergence. Furthermore, the global minimum is in the form of p * g (x) = p d (x) α * +λE f (x) , where α * ∈ R. Theorem 3.2. Under mild regularity conditions in Assumption A.1, for any λ > 0, there exists a unique global minimum of the problem in Eq. ( 2) with the JS divergence. Furthermore, the global minimum is in the form of p * g (x) = p d (x) e α * +λE f (x) -1 , where α * ∈ R. As shown in Theorem 3.1 and Theorem 3.2, the global minimum is in the form of a reweighted data distribution and the weights are negatively correlated to the energy function defined by the pre-trained model. Qualitatively, the global minimum assigns high density for a sample x if it has high density under the data distribution (i.e., p d (x)) and low value of the energy function (i.e., E f (x)). Notably, the weights in Theorem 3.1 and Theorem 3.2 are different because of the different divergences. In particular, the effect of the pre-trained model is enlarged by the exponential term in Theorem 3.2 using JS divergence (JSD). Fig. 2 (c ) and (d) show that with the same value of λ, the weighting coefficients of JSD are distributed in a larger range than KL divergence (KLD). Naturally, in both theorems, as λ → 0, the denominator of p * g (x) tends to a constant and p * g (x) → p d (x), which recovers the solution of pure divergence minimization in Eq. ( 1). Therefore, Reg-DGM is consistent if the weighting parameter λ is a function of m, and lim m→∞ λ(m) → 0.

3.2. ANALYSES IN THE PARAMETRIC SETTINGS

Although it provides theoretical insights of Reg-DGM, the non-parametric setting is far from practice. In fact, in our experiments, the hypothesis class is parameterized by neural networks and the training data is finite. In such a case, Reg-DGM can be formulated as a non-convex optimization problem, which is solved by gradient-based methods. Therefore, we also analyze the convergence of Reg-DGM trained by (stochastic) gradient descent in the presence of neural networks upon the general convergence framework (Allen-Zhu et al., 2019) . In particular, as summarized in Theorem 3.3, we show that Reg-DGM with a standard neural network architecture converges with a high probability under mild smoothness assumptions. The assumptions, result and proof are formally presented in Appendix A.2.2. Theorem 3.3 (Convergence of Reg-DGM (informal)). Under standard and verifiable smoothness assumptions, with a high probability, Reg-DGM with a sufficiently wide ReLU CNN converges to a global optimum of Eq. ( 2) trained by GD and converges to a local minimum trained by SGD.

4. IMPLEMENTATION

In this section, we discuss the base DGM, the pre-trained model and the energy function in practice.

4.1. BASE MODEL

Although Reg-DGM applies to variational auto-encoders (VAE) (Kingma & Welling, 2013) and many other DGMs, we focus on GANs (Goodfellow et al., 2014) , which are most representative and popular in the scenarios with limited data (Karras et al., 2020a; Mo et al., 2020; Cui et al., 2021) . Formally, GANs optimize an estimate of the JS divergence via a minimax formulation as follows: min G max D E x∼p d (x) [log D(x)] + E x∼pg(x) [log(1 -D(x))], where G is a generator that defines p g (x) and D is a discriminator that estimates the JS divergence by discriminating samples. Both G and D are parameterized by neural networks and Eq. ( 4) is estimated by the Monte Carlo method over mini-batches sampled from the training set. For a broader interest, we adopt three strong GAN variants, StyleGAN2 (Karras et al., 2020b) , adaptive discriminator augmentation (ADA) (Karras et al., 2020a) , and adaptive pseudo augmentation (APA) (Jiang et al., 2021) as the base DGMs. Please refer to Appendix B.2 for more details.

4.2. PRE-TRAINED MODEL

As mentioned in Sec. 2, different from the transfer-based methods (Wang et al., 2018; Mo et al., 2020) , Reg-DGM applies to a nontransferable setting where the pre-trained model does not necessarily have the same architecture or the same formulation as p g or does not even have to be a generative model, enjoying the flexibility of choosing the pre-trained models. In our implementation, the pre-trained model is a feature extractor f : X → R d , which is trained for other tasks (e.g., classification or contrastive representation learning) instead of generation. We choose such models because they are easily available and achieve excellent performance in supervised learning. In particular, we investigate three prototypical pre-trained models: a ResNet (He et al., 2015) trained in a supervised manner on ImageNet, a CLIP image encoder (Radford et al., 2021) trained in a self-supervised manner on a large-scale image-text dataset, and a FaceNet (Schroff et al., 2015) trained on a face recognition dataset. Please refer to Appendix B.2 for more details. Note that such models are nontransferable in the fine-tuning manner (Hinton et al., 2006) . Nevertheless, with such models and a data-dependent energy function presented later, Reg-DGM is still competitive to the transfer-based approaches (Wang et al., 2018; Mo et al., 2020) as shown in Tab. 1.

4.3. ENERGY FUNCTION

According to the results in Fig. 1 (c) and our intuition, we should define E f such that p f is as close to p d as possible. In most of the cases, f is pre-trained on a dataset with richer semantics than p d . Therefore, it is necessary to involve training data (sampled from p d ) in the energy function to reducing the distance between p f and p d . As presented above, we specify f as a feature extractor and it is natural to match the features of samples from p g and p d as the energy function. Formally, the energy function is defined by the expected mean squared error between the features of a generated sample and a training sample as follows: E f (x) := E x ′ ∼p d 1 d ||f (x) -f (x ′ )|| 2 2 . (5) Notably, our implementation with the data-dependent energy function in Eq. ( 5) is a valid instance of the general Reg-DGM framework as formulated in Eq. ( 2). Furthermore, the convergence results in both the non-parametric and parametric settings (see Sec. 3) hold in this case. The expectation is estimated by the Monte Carlo method of a single sample for efficiency by default and increasing the number of samples will not affect the performance significantly (see results in Appendix C.4). We emphasize that our main contribution is not designing a specific energy function but the general framework of Reg-DGM. Many alternative energy functions can be employed in Reg-DGM. Indeed, we perform a systematical ablation study of the energy functions in Sec. 5.3 and find that Eq. ( 5) is the best among them considering the qualitative and quantitative results together. Moreover, we evaluate the effectiveness of Reg-DGM implemented by Eq. ( 5), with strong base DGMs (Karras et al., 2020b; a; Jiang et al., 2021) , different datasets, different backbones of f and different pretraining datasets in the experiments. We observe a consistent and significant improvement over SOTA baselines across various settings. Based on such a comprehensive empirical study, we believe that our implementation of Reg-DGM based on Eq. ( 5) would be effective in new settings.

5. EXPERIMENTS

For a fair comparison to a large family of prior work, we evaluate Reg-DGM on several widely adopted benchmarks with limited data (Karras et al., 2020a) , including the FFHQ (Karras et al., 2019) , 100-shot Obama (Zhao et al., 2020a), LSUN CAT (Yu et al., 2015) and CIFAR-10 ( Krizhevsky et al., 2009) datasets, and the data processing and metric calculating are the same as those of ADA (Karras et al., 2020a) . We present the main results and analyses in the section and refer the readers to Appendix B.2 for details and Appendix C for additional results. Throughout the section, we refer to our approach as the name of the base DGM with the prefix "Reg-". For instance, "Reg-ADA" denotes our approach with ADA (Karras et al., 2020a) as the base DGM. Quantitatively, we compare Reg-DGM with a large family of existing methods, including the base DGMs, the transfer-based approaches, the augmentation-based methods and many others. Following the direct and strong competitor (Karras et al., 2020a) , we report the median Fréchet inception distance (FID) (Heusel et al., 2017) on FFHQ and LSUN CAT, and the mean FID on CIFAR-10 out of 3 runs for a fair comparison in Tab. 1. For completeness, we also report the mean FID with the standard deviation on FFHQ and LSUN CAT in Appendix C.1. As shown in Tab. 1, Reg-StyleGAN2, Reg-ADA and Reg-ADA-APA consistently outperform the corresponding base DGM in five settings, demonstrating that Reg-DGM can achieve a good biasvariance tradeoff in practice. Besides, the superior performance of Reg-ADA over ADA (and Reg-ADA-APA over ADA-APA) shows that our contribution is orthogonal to the augmentation-based approaches. Notably, the improvement of Reg-DGM over the base DGM is larger when the sample size m is smaller. This is as expected because the relative gain of Reg-DGM over the base DGM We mention that the methods based on fine-tuning (Wang et al., 2018; Mo et al., 2020) in Tab. 1 employ a GAN pre-trained on CelebA-HQ (Karras et al., 2017) , which is also a face dataset of the same resolution as FFHQ. In comparison, our approach is built upon a classifier pertained on ImageNet. As highlighted in (Wang et al., 2018) , the density of the pre-training dataset is more important than the diversity. Nevertheless, Reg-DGM is competitive with these strong baselines while enjoying the flexibility of choosing the pre-trained model and dataset. Further, we directly adopt the same pre-trained model across all datasets including CIFAR-10, where it takes additional efforts to get a suitable pre-trained generative model to fine-tune. Based on the results, it is safe to emphasize the complementary role of Reg-DGM to the approaches based on fine-tuning. Qualitatively, we show the samples generated from Reg-DGM on FFHQ-5k and 100-shot Obama in Fig. 3 . It can be seen that with the regularization, our approach can produce faces of a normal shape with limited data. We present the results from the base DGMs and more samples in other settings in Appendix C.2. For a comprehensive comparison on the visual quality of samples, we perform a human evaluation by the Amazon Mechanical Turk (AMT) as in prior work (Choi et al., 2020) . According with the FID results, Reg-StyleGAN2 trained on FFHQ-5k against the baseline StyleGAN2 is chosen in 63.5% of the 3,000 image quality comparison tasks. See details in Appendix C.4.

5.2. ABLATION OF PRE-TRAINED MODELS AND PRE-TRAINING DATASETS

For simplicity, the main results of Reg-DGM presented in Sec. 5.1 are based on a ResNet-18 model pre-trained on ImageNet. Although its architecture is significantly different from the Inception-v3 (Szegedy et al., 2016) used in FID and KID calculation, it is worth performing an ablation on the pre-trained models to eliminate the potential bias caused by the pre-training dataset. In particular, we test Reg-DGM with the image encoder of the CLIP model (Radford et al., 2021) (an architecture very similar to ResNet-50), which is pre-trained on large-scale noisy text-image pairs instead of ImageNet, and with the face recognizer Inception-ResNet-v1 (Szegedy et al., 2017) of FaceNet (Schroff et al., 2015) , which is pre-trained on the large-scale face dataset VGGFace2 (Cao et al., 2018) 

5.3. ABLATION OF ENERGY FUNCTIONS

As emphasized in Sec. 4, our main contribution is the general framework instead of a specific energy function. Nevertheless, we still investigate two alternative regularization terms for completeness. We first consider the famous entropy-minimization regularization (Grandvalet & Bengio, 2004) , which is data-independent. Intuitively, the regularization forces p g that produces samples with clear semantics. We find that the entropy regularization achieves similar FID results of 50.87 on FFHQ-5k to the baseline 52.71, showing the importance of the data dependency in the energy function. We then investigate another data-dependent regularization term, i.e., feature matching (Salimans et al., 2016) , which matches the averaged features between the model and data distributions. We find that feature matching can achieve FID 32.65 on FFHQ-5k greatly reducing the FID of StyleGAN2 while it cannot improve the visual quality of the samples. Please refer to Appendix C.4 for details of both energy functions. Therefore, Eq. ( 5) is the best among them considering the qualitative and quantitative results together and we believe it can be transferred to new settings based on our results in Sec. 5.1 and Sec. 5.2.

6. RELATED WORK

Fine-tuning approaches. A milestone of deep learning is that a deep generative model fine-tuned for classification outperforms the classical SVM on recognizing the hand-writing digits (Hinton et al., 2006) . Since then, the idea of fine-tuning has a significant impact (Devlin et al., 2018; He et al., 2020) including generative models with limited data (Wang et al., 2018; Mo et al., 2020; Wang et al., 2020; Li et al., 2020; Ojha et al., 2021) . However, an inherent restriction of fine-tuning is that the pre-trained model and target model should partially share a common structure. Thus, it may take additional efforts to find a suitable pre-trained model to fine-tune. In comparison, Reg-DGM provides an alternative way to make it possible to exploit a pre-trained classifier to help generative modeling. Notably, the latter is often thought of as much harder than the former. Other generative adversarial networks with limited data. To relieve the overfitting problem of the discriminator, DA (Zhao et al., 2020a), ADA (Karras et al., 2020a) and APA (Jiang et al., 2021) design sophisticated data augmentation strategies. GenCo (Cui et al., 2021) designs a cotraining framework that introduces multiple complimentary discriminators. InsGen (Yang et al., 2021) improves the data efficiency of GANs via an instance discrimination loss (Wu et al., 2018) . We believe that Reg-DGM is orthogonal to these methods based on our results in Tab. 1. Regularization in probabilistic models. Extensive regularization approaches have been developed in traditional Bayesian inference (Zhu et al., 2014) and probabilistic modeling (Chang et al., 2007; Liang et al., 2009; Ganchev et al., 2010) . Among them, posterior regularization (PR) (Ganchev et al., 2010) encodes the human knowledge about the task as linear constraints of the latent representations in generative models for better inference performance. Such methods have been extended to deep generative models (Hu et al., 2018; Du et al., 2018; Shu et al., 2018; Xu et al., 2019) for a similar reason. Technically, PR-based methods regularize the latent space via handcrafted or jointly trained constraints. In comparison, our approach regularizes the data space via a pre-trained model. Besides, PR-based methods are suitable for structured prediction tasks instead of generative modeling with limited data, which is the main focus of this paper.

7. CONCLUSIONS

In 

A PROOFS A.1 THE GAUSSIAN-FITTING EXAMPLE

We first derive the solution of Reg-DGM in the main text. In the Gaussian fitting example, the regularized optimization problem can be written as μREG = arg min µ - 1 m m i=1 log N (µ, σ 2 ) + λE x∼N (µ,σ 2 ) [-log p f (x)] (6) = arg min µ 1 m m i=1 (µ -x i ) 2 2σ 2 + λ (µ -μPRE ) 2 2σ 2 + 1 2 log(2πσ 2 ) + 1 2 (7) = arg min µ 1 m m i=1 (µ -x i ) 2 2σ 2 + λ (µ -μPRE ) 2 2σ 2 , ( ) where the first equality holds by the definition and properties of Gaussian (Bishop & Nasrabadi, 2006) and the second equality holds by omitting a constant irrelevant to the optimization. It is easy to solve the above quadratic programming problem analytically: μREG = 1 m(1 + λ) m i=1 x i + λ 1 + λ μPRE = 1 1 + λ μMLE + λ 1 + λ μPRE , where μMLE ∼ N µ * , σ 2 m . μREG is obtained by an affine transformation of a Gaussian random variable and thus is also Gaussian distributed as follows: μREG ∼ N 1 1 + λ µ * + λ 1 + λ μPRE , σ 2 m(1 + λ) 2 . (10) A.1.1 PROOF OF PROPOSITION 2.2 Proof. In the Gaussian-fitting example, we have μMLE ∼ N µ * , σ 2 m , and μREG ∼ N 1 1 + λ µ * + λ 1 + λ μPRE , σ 2 m(1 + λ) 2 . ( ) According to the bias-variance decomposition of the MSE, we have MSE[μ MLE ] = σ 2 m , and MSE[μ REG ] = λ 2 (1 + λ) 2 (μ PRE -µ * ) 2 + σ 2 m(1 + λ) 2 . ( ) Let β = λ 1+λ ∈ (0, 1) be the normalized weight of the regularization term. Then, we can rewrite MSE[μ REG ] as MSE[μ REG ] = β 2 (μ PRE -µ * ) 2 + (1 -β) 2 σ 2 m . ( ) To satisfy MSE[μ REG ] < MSE[μ MLE ], we have β 2 (μ PRE -µ * ) 2 + (1 -β) 2 σ 2 m < σ 2 m ⇒ β < 2σ 2 σ 2 + m(μ PRE -µ * ) 2 . ( ) The pre-trained model μPRE is another meaningful baseline, which has a bias of μPRE -µ * and a zero variance. Its MSE is given by MSE[μ PRE ] = (μ PRE -µ * ) 2 . ( ) To satisfy MSE[μ REG ] < MSE[μ PRE ], we have β 2 (μ PRE -µ * ) 2 + (1 -β) 2 σ 2 m < (μ PRE -µ * ) 2 ⇒ β > σ 2 -m(μ PRE -µ * ) 2 σ 2 + m(μ PRE -µ * ) 2 , ( ) which completes the proof. We now computes the optimal MSE[μ REG ]. It is easy to see that the quadratic programming problem in Eq. ( 15) achieves its minimum at β * = σ 2 σ 2 +m(μPRE-µ * ) 2 with the corresponding λ * = σ 2 m(μPRE-µ * ) 2 . The minimum value is σ 2 (μPRE-µ * ) 2 σ 2 +m(μPRE-µ * ) 2 = MSE[μMLE]MSE[μPRE] MSE[μMLE]+MSE[μPRE] .

A.1.2 COMPARISON UNDER THE EXPECTED RISK

We also evaluate all estimators in terms of the expected risk, which is a commonly used measure in statistical learning theory. In a density estimation task for p d , the expected risk of a hypothesis μ, which depends on the training sample S, is R(μ ) := E x∼p d [-log p(x; μ)]. We can show that the expectation of the expected risk w.r.t. S coincides with the corresponding MSE in the Gaussianfitting example and directly obtain the following Corollary A.0.1 from Proposition 2.2. Corollary A.0.1. In the Gaussian-fitting example 2.1, if λ satisfies the same condition as in Proposition 2.2, then the following inequality holds: E S [R(μ REG )] < min{E S [R(μ MLE )], E S [R(μ PRE )]}. Proof. In the Gaussian-fitting example, the expected risk for a hypothesis μ is R(μ) = E S∼p m d E x∼N (µ * ,σ 2 ) -log N (μ, σ 2 ) (20) = E S∼p m d (μ -µ * ) 2 2σ 2 + 1 2 log(2πσ 2 ) + 1 2 (21) = 1 2σ 2 MSE[μ] + 1 2 log(2πσ 2 ) + 1 2 , which completes the proof together with Proposition 2.2.

A.2 OPTIMIZATION ANALYSES

A.2.1 PROOF OF THEOREM 3.1 AND THEOREM 3.2 Our results rely on the following regularity conditions. We define A = {α ∈ R| p d (x) α+λE f (x) ≥ 0, ∀x ∈ X }. According to Assumption 3.1 that E f is bounded on X , we have A ̸ = ∅. We define a function ϕ : A → R as ϕ(α) = X p d (x) α + λE f (x) dx. It is easy to see ϕ(α) is monotonically decreasing and there is at most one α * ∈ A such that ϕ(α * ) = 1, which finishes the proof together with the existence of the global minimum. We now prove Theorem 3.2 as follows. Proof. The proof here shares the same spirit of Theorem 3.1. Ignoring some constant irrelevant to the optimization, we rewrite the optimization problem of our approach with the JS divergence as follows: min pg X (-ln(p g (x) + p d (x))p d (x) -ln(p g (x) + p d (x))p g (x) + ln(p g (x))p g (x)+λE f (x)p g (x))dx, subject to X p g (x)dx = 1, ∀x ∈ X , p g (x) ≥ 0. For clarity, we denote the functional in problem Eq. ( 23) to be optimized as J (p g ) := X (-ln(p g (x) + p d (x))p d (x) -ln(p g (x) + p d (x))p g (x) + ln(p g (x))p g (x)+ (29) λE f (x)p g (x))dx. Similarly to the proof of Theorem 3.1, the global minimum of J (p g ) exists in the feasible area due to Assumption A.1 and the fact that the JS divergence is lower semi-continuous in the topology of weak convergence. Notably, the optimization problem Eq. ( 28) is convex due to the convexity of the JS divergence (Billingsley, 2013) . To obtain a necessary condition for the global minima, we get the Lagrangian with the equality constraint: L(p g ) := X (-ln(p g (x) + p d (x))p d (x) -ln(p g (x) + p d (x))p g (x) + ln(p g (x))p g (x)+ λE f (x)p g (x))dx + α( X p g (x)dx -1), where α ∈ R. Similarly to the proof of Theorem 3.1, a necessary condition for a global minimum of problem Eq. ( 23) is δL δp g = -ln(p g (x) + p d (x)) + ln(p g (x)) + λE f (x) + α = 0, which implies that p * g (x) = p d (x) e α+λE f (x) -1 . ( ) Similarly to the proof in Theorem 3.1, there is at most one α * ∈ R such that X p d (x) e α+λE f (x) -1 = 1, which finishes the proof together with the existence of the global minimum.

A.2.2 CONVERGENCE WITH NEURAL NETWORKS TRAINED BY (STOCHASTIC) GRADIENT DESCENT

We establish the convergence of Reg-DGM with over-parameterized neural networks trained by (stochastic) gradient descent upon the general framework (Allen-Zhu et al., 2019) . Theorem A.2. (General convergence guarantee (Allen-Zhu et al., 2019) ) For an arbitrary Lipschitzsmooth loss function L, with probability at least 1 -e -Ω(log m) , a ReLU convolutional neural network with width m and depth l trained by gradient descent with an appropriate learning rate satisfy the following.foot_2  • If L is non-convex, and σ-gradient dominant, then GD finds ϵ-error minimizer in Õ(poly(n, l, log 1 ϵ , 1 σ )) iterations as long as m ≥ Ω(poly(n, l, 1 σ )). • If L is non-convex, then SGD finds a point with ||∇f || ≤ ϵ in Õ(poly(m, l, log 1 ϵ )) iterations as long as m ≥ Ω(poly(n, l, 1 ϵ )). We assume the following standard smoothness conditions, which can be verified in practice with bounded data and weights. Assumption A.3. (Smoothness conditions) 1. ∃b > 0 such that ∀θ ∈ Θ, ∀x ∈ X , p(θ; x) ≥ b. 2. ∃L > 0 such that ∀x ∈ X , ∀θ ∈ Θ, ∀θ ′ ∈ Θ, |p(θ; x) -p(θ ′ ; x)| ≤ L||θ -θ ′ ||. 3. ∃K > 0 such that ∀θ ∈ Θ, ∀θ ′ ∈ Θ, |p(θ; x) -p(θ ′ ; x)|dx ≤ K||θ -θ ′ ||. 4. sup x∈X sup y∈X ||f (x) -f (y)|| 2 ≤ B. We consider the general density estimation setting where L MLE (θ; x i ) := -log p θ (x i ). Note that the first two conditions in Assumption A.3 directly imply that L MLE (•; x) is L b -Lipschitz. Formally, given a set of samples S = {x i } n i=1 , Reg-DGM optimizes R[θ] := 1 n n i=1 L MLE (θ; x i ) + λE x∼pg [E f (x)]. If E f is independent from the training data x, then the overall loss function is also L b -Lipschitz and Theorem A.2 directly applies. Otherwise, we can show that the data-dependent regularization used in our experiments is also Lipschitz-smooth. By the linearity of expectation, we have θREG := arg min θ∈Θ 1 n n i=1 [L MLE (θ; x i )] + λ d E y∼p θ 1 n n i=1 ||f (y) -f (x i )|| 2 2 (34) = arg min θ∈Θ 1 n n i=1 L MLE (θ; x i ) + λ d E y∼p θ ||f (y) -f (x i )|| 2 2 . ( ) We define L REG (θ; x i ) := λ d E y∼p θ ||f (y) -f (x i )|| 2 2 , which is Lipschitz-smooth: |L REG (θ; x) -L REG (θ ′ ; x)| = λ d E y∼p θ ||f (y) -f (x)|| 2 2 -E y∼p θ ′ ||f (y) -f (x)|| 2 2 ≤ λ d |p θ (y) -p θ ′ (y)|||f (x) -f (y)|| 2 dy ≤ λ d (sup y∈X ||f (x) -f (y)|| 2 ) |p θ (y) -p θ ′ (y)|dy ≤ λBK d ||θ -θ ′ ||. Therefore, Theorem A.2 applies to Reg-DGM with the data-dependent energy defined in Sec. 4 of the main text.

B EXPERIMENTAL DETAILS

Our implementation is built upon some publicly available code. Below, we include the links and please refer to the licenses therein. B.1 TOY DATA In the toy example for optimization analyses, the data follows a uniform distribution over [0, 1]. The energy function is defined as E f (x) = 0.7x + 0.9. The optimal β * is estimated by numerical integration. In the experiments for the Gaussian-fitting example, we set σ 2 = 1, m = 150, and μPRE -µ * = 0.1 by default.

B.2 GANS WITH LIMITED DATA

Datasets. In our experiments, we use FFHQ (Karras et al., 2019) , which consists of 70, 000 human face images of resolution 256 × 256, LSUN CAT (Yu et al., 2015) , which consists of 200, 000 cat images of resolution 256 × 256, and CIFAR-10 ( Krizhevsky et al., 2009) , which consists of 50, 000 natural images of resolution 32 × 32. Specifically, we randomly split training subsets of size 1k and 5k from full FFHQ and LSUN CAT in the same way as ADA (Karras et al., 2020a) . For reproducibility, we directly use the random seed provided by the official implementation of ADAfoot_3 . In addition, we also experiment on the smaller dataset 100-shot Obama (Zhao et al., 2020a) with only 100 face images of 256 × 256 resolution. In all experiments, we do not use x-flips to amplify training data except for combining with APA (Jiang et al., 2021) . Metrics. To quantitatively evaluate the experimental results, we choose the Fréchet inception distance (FID) (Heusel et al., 2017) and the kernel inception distance (KID) (Bińkowski et al., 2018) as our metrics. We compute the FID and KID between 50, 000 generated images and all real images instead of training subsets (Heusel et al., 2017) . Following ADA (Karras et al., 2020a) , we report the medium FID and corresponding KID on FFHQ and LSUN CAT and the mean FID with standard deviation on CIFAR-10 out of 3 runs. We record the best FID during training in each run as in ADA (Karras et al., 2020a) . Base DGM. In particular, the lighter-weight StyleGAN2 is the backbone for FFHQ and LSUN CAT, and the tuning StyleGAN2 is the backbone for CIFAR-10 following ADA (Karras et al., 2020a) . Compared to the official StyleGAN2, the lighter-weight StyleGAN2 has the same performance and less computing cost on the FFHQ and LSUN CAT and the tuning StyleGAN2 is more suitable for CIFAR-10 ( Karras et al., 2020a) . Adaptive discriminator augmentation (ADA) (Karras et al., 2020a) is a representative way of data augmentation for GANs under limited data, and its combination with adaptive pseudo augmentation (APA) (Jiang et al., 2021) is the state-of-the-art method for few-shot image generation (Li et al., 2022) . Our implementation is based on the official code of ADA 5 and APAfoot_4 . Pre-trained model. We choose the ResNet-18foot_5 trained on the ImageNet dataset as the pre-trained model by default due to its excellent performance on ImageNet (Deng et al., 2009) and little computational overhead. We normalize both the real and fake images based on the mean and standard deviation of training data and then feed them into the classifier. We extract the features of the last fully connected layer in the pre-trained model, which is frozen during training of generative models. On CIFAR-10, we interpolatefoot_6 both the fake and real images to a resolution of 256 × 256 after normalization. Other alternative pre-trained models are the image encoder ResNet-50 of the CLIP model (Radford et al., 2021) , which is pre-trained on large-scale noisy text-image pairs instead of ImageNet, and the face recognizer Inception-ResNet-v1 (Szegedy et al., 2017) of FaceNet (Schroff et al., 2015) , which is pre-trained on the large-scale face dataset VGGFace2 (Cao et al., 2018) . Notably, we replace the last attention pooling layer in the image encoder of CLIP with the global average pooling layer, and we directly pass the image to the face recognizer without operating the face detector in FaceNet 

C.3 LEARNING CURVES

We show the learning curves of GANs on FFHQ, LSUN CAT and CIFAR-10 in Fig. 6 , Fig. 7 and Fig. 8 respectively. Reg-DGM consistently improves both baselines in all settings and the improvements increase as the number of the training data decreases. Moreover, the curves of Reg-DGM generally have a smaller fluctuation, which is consistent with our theory that the regularization reduces the variance of the baselines. One exception is Fig. 8 (a) , which shows that Reg-DGM is more unstable than the baseline, which is caused by a bad random initialization. We mention that the instability of Reg-ADA in Fig. 6 (b) is due to that of ADA.

C.4 ABLATION OF CLASSIFIER AND RESULTS UNDER MORE EVALUATION METRICS

In this section, to better understand the influence of classifiers on our method, we try to use pretrained classifiers with different regularization terms, layers, and backbones. We retain the same experimental setting as in Tab. 3. Regularization form. We first investigate the feature matching objective (Salimans et al., 2016) as an alternative regularization term. Formally, it computes the square of the l 2 -norm between expected features of real and fake samples from one layer of a feature extractor f , which can be represented as follows: ||E x ′ ∼p d [f (x ′ )] -E x∼pg [f (x)] || 2 2 . ( ) Note that the feature matching objective cannot be rewritten as the expectation over p g and thus cannot be understood as an energy function. As before, we adopt the last fully-connected layer of ResNet-18, and the results of StyleGAN2 regularized by Eq. ( 36) are shown in Tab. 6. Feature matching can greatly reduce FID of StyleGAN2 while it cannot improve the visual quality of the samples, as shown in Fig. 9 . Then, we evaluate the entropy-minimization regularization (Grandvalet & Bengio, 2004) as follows: -H(softmax(f (x))), where f (x) outputs the logits for the prediction distribution. As shown in Tab. 7, the entropy regularization achieves similar FID results to the baseline within a small search space of λ, showing the importance of the data dependency in the energy function. Layers in f . To explore the different layers of a pre-trained model, we retrain GANs separately using the first convolution layer of ResNet-18 and the last layers of four modules in ResNet-18. As shown in Tab. 8, our method with features of diverse single layers can all improve the baseline StyleGAN2, and the last layer of ResNet-18 is most beneficial for our regularization strategy. Backbone of f . We employ ResNet-50 and ResNet-101 as feature extractors to explore the effect of different backbones on Reg-DGM. Tab. 9 shows the results on FFHQ-5k. Even using the default λ without tuning, Reg-DGM with ResNet-50 and ResNet-101 can achieve a similar FID to that of our default setting and outperform the baseline. We believe that Reg-DGM with ResNet-50 and ResNet-101 can get better results if we finely search the hyperparameter λ. Monte Carlo estimate of f . We evaluate Reg-StyleGAN2 with 8, 16, 32, and 64 (the default batch size) samples to estimate the energy function in Eq. ( 5) of the main text, as shown in Tab. 10. We do not observe a significant improvement by increasing the number of samples. For instance, when λ = 1, the estimate with 64 samples achieves an FID of 38.10, which is similar to 37.77 of the single sample estimate. Intuitively, the features of faces are likely concentrated in a small area of the feature space of f , which is discriminative to other classes of natural images like cars, making the variance negligible to the training process of the generative model. Human evaluation. We compare the quality of generated images from the human perspective by Amazon Mechanical Turk (AMT) as in prior work (Choi et al., 2020) . In particular, we randomly generate 1, 000 pairs of images by StyleGAN2 and Reg-StyleGAN2 trained on FFHQ-5k using the truncation trick (ψ = 0.7). For each image pair, three unique workers in AMT choose the image with the more realistic and high-quality face image. Finally, 3, 000 selection tasks are completed by 34 workers within 14 hours. As consistent with the comparison of image quality between Fig. 4 (a) and Fig. 4 (b), Reg-StyleGAN2 is chosen in 63.5% of the 3, 000 tasks, and other statistical information is shown in Tab. 12. Results on more training sets with different size. To explore the influence of our method on training data with different sizes, we test the Reg-StyleGAN2 with the pre-trained ResNet-18 on different subsets of FFHQ. The results are shown in Tab. 13 and the hyperparameter λ is simply fixed as 1 for new data settings 100, 2k, 7k, 10k, and 15k. Without heavily tuning the hyperparameter λ, Reg-StyleGAN2 shows consistent improvements over the baseline under the FID metric. There seems to be a trend that our method improves more on smaller datasets, which agrees with our Gaussian case illustrated by the Fig. 1 (b).

C.5 SENSITIVITY ANALYSIS OF THE WEIGHTING HYPERPARAMETER

We empirically analyze the sensitivity of the weighting hyperparameter λ on FFHQ-5k with Style-GAN2 as the base model in Tab. 14. It can be seen that λ affects the performance significantly. Notably, although it is nearly impossible to get the optimal λ via grid search, there is a range of λ (e.g. from 0.01 to 1 in Tab. 14) such that Reg-DGM outperforms the base DGM, which agrees with Table 14 : Sensitivity analysis of λ on FFHQ-5k with StyleGAN2 as the base DGM. Values of λ λ = 0 (base DGM) λ = 0.01 λ = 0.1 λ = 1 λ = 10 λ = 100 



The expected risk coincides with the MSE in the example. See Appendix A.2.2 for generalization analyses. Note that max σ 2 -m( μPRE -µ * ) 2 σ 2 +m( μPRE -µ * ) 2 , 0 < min See additional assumptions on regularity of the data in(Allen-Zhu et al., 2019). https://github.com/NVlabs/stylegan2-ada-pytorch https://github.com/EndlessSora/DeceiveD https://pytorch.org/vision/stable/models.html https://pytorch.org/docs/stable/generated/torch.nn.functional. interpolate.html https://github.com/openai/CLIP https://github.com/timesler/facenet-pytorch



Figure 1: MSE in the Gaussian-fitting example to validate Proposition 2.2. (a) The effect of λ given a fixed m and μPRE -µ * , where the red circle indicates β = max σ 2 -m(μ P RE -µ * ) 2 σ 2 +m(μ P RE -µ * ) 2 , 0 and the black circle indicates β = min{ 2σ 2 σ 2 +m(μ P RE -µ * ) 2 , 1}. (b) The effect of the sample size m. (c) The effect of the bias of the pre-trained model. In both (b) and (c), the hyperparameter λ is optimal.

Figure 2: Illustration of the optimization analyses. (a) The density of the data distribution. (b) The energy function defined by the pre-trained model. (c) The global minimum of the problem in Eq. (2) with the KL divergence (KLD) and different values of λ. (d) The global minimum of the problem in Eq. (2) with the JS divergence (JSD) and different values of λ. See more details in Appendix B.1.

t. the value of λ, m and | μPRE -µ * | in Fig. 1 for a clearer illustration. Fig. 1 (a) directly validates the results of Proposition 2.2. Fig. 1 (b) and (c) show that with the optimal λ, the regularization gains more w.r.t. MLE as the sample size m decreases and | μPRE -µ * | increases. See more details of the plot in Appendix B.1.

Figure3: Samples from the Reg-ADA, truncated (ψ = 0.7) as in prior work(Karras et al., 2020a).

Figure 4: Samples generated for FFHQ-5k, truncated (ψ = 0.7).

Figure 5: Samples generated for CIFAR-10, truncated (ψ = 0.7).

Figure 6: Learning Curves on FFHQ.

Figure 8: Learning Curves on CIFAR-10.

Figure 9: Samples generated for FFHQ-5k using feature matching (FID 32.65), truncated (ψ = 0.7).

Median FID ↓ on FFHQ and LSUN CAT and mean FID ↓ on CIFAR-10. † and ‡ indicate the results are taken from the references andKarras et al. (2020a)  respectively. Otherwise, the results are reproduced by us upon the official implementation(Karras et al., 2020a;Jiang et al., 2021).

Median FID ↓ and the corresponding KID×10 3 ↓ using a pre-trained CLIP or FaceNet. decreases given a fixed pre-trained model and the optimal λ (see Fig.1 (b)). Further, we evaluate the base DGMs and Reg-DGMs under the kernel inception distance (KID)(Bińkowski et al., 2018) metric in Appendix C.4. The conclusion remains the same. As suggested by Proposition 2.2, the value of the weighting hyperparameter λ is crucial and there is an appropriate range of λ such that Reg-DGM is better than the base DGM. We empirically validate the argument on FFHQ-5k with StyleGAN2 as the base model in Appendix C.5.

this paper, we propose regularized deep generative model (Reg-DGM), which leverages a pretrained model for regularization to reduce the variance of DGMs with limited data. Theoretically, we analyze the convergence properties of Reg-DGM. Empirically, with various pre-trained feature work presents a framework to train deep generative models on small data. By improving the data efficiency, it can potentially benefit real-world applications like medicine analysis and automatic drive. However, this work can have negative consequences in the form of "DeepFakes", as existing GANs. It is worth noting that this work may exacerbate such issues by improving the data efficiency of GANs. How to detect "DeepFakes" is an active research area in machine learning, which aims to relieve the problem.REPRODUCIBILITY STATEMENTWe submit the source code in the supplementary material and have released it. Datasets used in experiments are open and publicly available, and experimental details are provided in the Appendix B. In addition, the complete proof of the propositions and theorems is contained in the Appendix A. Shengyu Zhao, Zhijian Liu, Ji Lin, Jun-Yan Zhu, and Song Han. Differentiable augmentation for data-efficient gan training. Advances in Neural Information Processing Systems, 33:7559-7570, 2020a. Zhengli Zhao, Sameer Singh, Honglak Lee, Zizhao Zhang, Augustus Odena, and Han Zhang. Improved consistency regularization for gans. arXiv preprint arXiv:2002.04724, 2020b.

The mean FID ↓ and standard deviation on FFHQ and LSUN CAT which is a supplement to reported medium FID.Karras et al., 2020b)  102.62 ± 5.67 53.37 ± 1.92 189.57 ± 8.13 110.83 ± 6.85 Reg-StyleGAN2 (ours) 77.80 ± 3.65 38.14 ± 0.97 112.15 ± 8.48 64.11 ± 2.51 ADA (Karras et al., 2020a) 22.10 ± 0.50 12.72 ± 0.13 41.59 ± 1.71 16.77 ± 0.74 Reg-ADA (ours) 20.16 ± 0.22 11.88 ± 0.13 36.85 ± 1.09 15.85 ± 0.10

The median FID ↓ on FFHQ and LSUN CAT with feature matching as the regularization. λ is simply set as the same values in Tab. 3.

Median FID on FFHQ-5k for the entropy-minimization regularization. our theoretical analyses. Besides, Reg-DGM is not too sensitive when the value of λ is around the optimum. For instance, the gap between the results of λ = 0.1 and λ = 1 in Tab. 14 is much smaller than their gain compared to the baseline. The performance of Reg-DGM deteriorates with a large λ in Tab. 14 as expected. Proposition 2.2 shows that Reg-DGM is preferable if and only if its value is in a proper interval. Intuitively, a very large value means that we almost ignore the training data, which should lead to inferior performance.

Results with different layers on FFHQ-5k. The "-1 layer" represents the last layer (i.e., our default setting). Note that layers are all indexed by the function named modules in Pytorch. λ is simply set as the same values from Tab. 3.

Results with different backbones on FFHQ-5k. λ is simply set as the same values from Tab. 3.

FID ↓ on FFHQ-5k for different number of samples in MC.

KID×10 3 ↓ on FFHQ, LSUN-CAT, and CIFAR-10.

Other statistical information about human evaluation. Number-i indicates the number of image pairs that at least i workers think StyleGAN2 or Reg-StyleGAN2 generates more realistic face images.

The FID on different subsets of FFHQ.

annex

Assumption A.1.1. X is a nonempty compact set.2. E f : X → R is continuous and bounded.3. X e -E f (x) dx < ∞.The regularity conditions in Assumption A.1 are mild in the sense:1. The sample space X is often a subset of a n-dimensional Euclidean space. Then, X is compact if and only if X is bounded and closed, which holds for extensive datasets in various types including images, videos, audios, and texts.2. In this paper, E f is defined by compositing a neural network and a continuous real-valued function. Then E f is continuous and bounded on X if the neural network has bounded weights and uses continuous activation functions including ReLU (Nair & Hinton, 2010) , Tanh and Sigmoid.3. X e -E f (x) dx < ∞ holds following the choice of the sample space and energy function,We first prove Theorem 3.1 as follows.Proof. Ignoring some constant irrelevant to the optimization, we rewrite the optimization problem of our approach with the KL divergence as follows:For clarity, we denote the functional in problem Eq. ( 23) to be optimized asAccording to Assumption 3.1 that X is a nonempty compact set, the feasible area characterized by the constraints (i.e., P X ) is compact in the topology of weak convergence by the Prokhorov's Theorem (See Corollary 6.8 in (Van Gaans, 2003) ). By Theorem 3.6 in (Ajjanagadde et al., 2017) , KL divergence is lower semi-continuous in the topology of weak convergence. According to Assumption 3.2 that E f is continuous and bounded on X , the regularization term is continuous in the topology of weak convergence. Therefore, by the extreme value theorem, the global minimum of J (p g ) exists in the feasible area.Note that the optimization problem Eq. ( 23) is convex. To obtain a necessary condition for the global minima, we get the Lagrangian with the equality constraint:where α ∈ R. Note that for simplicity, we do not include the inequality constraint, which will be verified shortly. It is easy to check the constraint qualifications for problem Eq. ( 23). By the calculus of variations, a necessary condition for a global minimum of problem Eq. ( 23) iswhich implies that (Jiang et al., 2021) 0.5 1 0.001 to extract cropped faces. For pre-trained CLIP 9 and FaceNet 10 , we also employ their last layers as feature extractors.Hyperparameters. Some parameters are shown in Tab. 3. The weight parameter λ controls the strength of our regularization term. We choose the weighting hyperparameter λ by performing grid search over [50, 20, 10, 5, 4, 2, 1, 0.5, 0.1, 0.05, 0.01, 0.005, 0.001, 0.0005] for FFHQ and LSUN CAT, and [1, 0.1, 0.05, 0.01, 0.005, 0.001, 0.0005, 0.0001, 0.00005, 0.00001, 0.000005] for CIFAR-10 according to FID following prior work (Karras et al., 2020a) . Other parameters remain the same settings as ADA (Karras et al., 2020a) and APA (Jiang et al., 2021) . For pre-trained CLIP and FaceNet, the adopted parameter λ is shown in Tab. 4. In addition, we simply set λ as 0.1 for Reg-ADA trained on 100-shot Obama.Computing amount. A single experiment can be completed on 8 NVIDIA 2080Ti GPUs. It takes 1 day 6 hours 19 minutes to run our method with ADA on FFHQ or LSUN CAT and 2 days 17 hours 10 minutes on CIFAR-10 at a time.

C.1 STANDARD DEVIATION ON FFHQ AND LSUN CAT

As shown in Tab. 5, we also provide the mean FID and standard deviation on FFHQ and LSUN CAT. Reg-DGM can reduce the mean FID significantly (compared to the measurement variance) and achieve a similar if not smaller standard deviation.

