DEEP GENERATIVE MODELING ON LIMITED DATA WITH REGULARIZATION BY NONTRANSFERABLE PRE-TRAINED MODELS

Abstract

Deep generative models (DGMs) are data-eager because learning a complex model on limited data suffers from a large variance and easily overfits. Inspired by the classical perspective of the bias-variance tradeoff, we propose regularized deep generative model (Reg-DGM), which leverages a nontransferable pre-trained model to reduce the variance of generative modeling with limited data. Formally, Reg-DGM optimizes a weighted sum of a certain divergence and the expectation of an energy function, where the divergence is between the data and the model distributions, and the energy function is defined by the pre-trained model w.r.t. the model distribution. We analyze a simple yet representative Gaussian-fitting case to demonstrate how the weighting hyperparameter trades off the bias and the variance. Theoretically, we characterize the existence and the uniqueness of the global minimum of Reg-DGM in a non-parametric setting and prove its convergence with neural networks trained by gradient-based methods. Empirically, with various pretrained feature extractors and a data-dependent energy function, Reg-DGM consistently improves the generation performance of strong DGMs with limited data and achieves competitive results to the state-of-the-art methods. Our implementa-

1. INTRODUCTION

Deep generative models (DGMs) (Kingma & Welling, 2013; Goodfellow et al., 2014; Sohl-Dickstein et al., 2015; Van den Oord et al., 2016; Dinh et al., 2016; Hinton & Salakhutdinov, 2006) employ neural networks to capture the underlying distribution of high-dimensional data and find applications in various learning tasks (Kingma et al., 2014; Zhu et al., 2017; Razavi et al., 2019; Ramesh et al., 2021; 2022; Ho et al., 2022) . Such models are often data-eager (Li et al., 2021; Wang et al., 2018) due to the presence of complex function classes. Recent work (Karras et al., 2020a) found that the classical variants of generative adversarial networks (GANs) (Goodfellow et al., 2014; Karras et al., 2020b) produce poor samples with limited data, which is shared by other DGMs in principle. Thus, improving the sample efficiency is a common challenge for DGMs. The root cause of the problem is that learning a model in a complex class on limited data suffers from a large variance and easily overfits the training data (Mohri et al., 2018) . To relieve the problem, previous work either employed sophisticated data augmentation strategies (Zhao et al., 2020a; Karras et al., 2020a; Jiang et al., 2021) , or designed new losses for the discriminator in GANs (Cui et al., 2021; Yang et al., 2021) , or transferred a pre-trained DGM (Wang et al., 2018; Noguchi & Harada, 2019; Mo et al., 2020) . Although not pointed out in the literature to our knowledge, prior work can be understood as reducing the variance of the estimate implicitly (Mohri et al., 2018) . In this perspective, we propose a complementary framework, named regularized deep generative model (Reg-DGM), which employs a pre-trained model as regularization to achieve a better bias-variance tradeoff when training a DGM with limited data. In Sec. 2, we formulate the objective function of Reg-DGM as the sum of a certain divergence and a regularization term weighted by a hyperparameter. The divergence is between the data distribution and model distribution, and the regularization term can be understood as the negative expected loglikelihood of an energy-based model, whose energy function is defined by a pre-trained model, w.r.t. the model distribution. Intuitively, with an appropriate weighting hyperparameter, Reg-DGM balances between the data distribution and the pre-trained model to achieve a better bias-variance tradeoff with limited data, as validated by a simple yet prototypical Gaussian-fitting example. In Sec. 3, we characterize the optimization behavior of Reg-DGM in both non-parametric and parametric settings under mild regularity conditions. On one hand, we prove the existence and the uniqueness of the global minimum of the regularized optimization problem with the Kullback-Leibler (KL) and Jensen-Shannon (JS) divergence in the non-parametric setting. On the other hand, we prove that, parameterized by a standard neural network architecture, Reg-DGM converges with a high probability to a global (or local) optimum trained by gradient-based methods. In Sec. 4, we specify the components in Reg-DGM. We employ strong variants of GANs (Karras et al., 2020b; a; Jiang et al., 2021) as the base DGM for broader interests. We consider a nontransferable setting where the pre-trained model does not necessarily have to be a generative model. Indeed, we employ several feature extractors, which are trained for non-generative tasks such as image classification, representation learning, and face recognition. Notably, such models cannot be directly used in the fine-tuning approaches (Wang et al., 2018; Mo et al., 2020) . With a comprehensive ablation study, we define our default energy function as the expected mean squared error between the features of the generated samples and the training data. Such an energy not only fits our theoretical analyses, but also results in consistent and significant improvements over baselines in all settings. In Sec. 5, we present experiments on several benchmarks, including the FFHQ (Karras et al., 2019) , LSUN CAT (Yu et al., 2015), and CIFAR-10 (Krizhevsky et al., 2009) datasets. We compare Reg-DGM with a large family of methods, including the base DGMs (Karras et al., 2020b; a; Jiang et al., 2021) , the transfer-based approaches (Wang et al., 2018; Mo et al., 2020) , the augmentation-based methods (Zhao et al., 2020a) and others (Cui et al., 2021; Yang et al., 2021) . With a classifier pretrained on ImageNet (Deng et al., 2009) , or an image encoder pre-trained on the CLIP dataset (Radford et al., 2021) , or a face recognizer pre-trained on VGGFace2 (Cao et al., 2018) , Reg-DGM consistently improves strong DGMs under commonly used performance measures with limited data and achieves competitive results to the state-of-the-art methods. Our results demonstrate that Reg-DGM can achieve a good bias-variance tradeoff in practice, which supports our motivation. In Sec. 6, we present the related work. In Sec. 7, we conclude the paper and discuss limitations.

2. METHOD

The goal of generative modeling is to learn a model distribution p g (implicitly or explicitly) from a training set S = {x i } m i=1 of size m on a sample space X . The elements in S are assumed to be drawn i.i.d. according to an unknown data distribution p d ∈ P X , where P X is the set of all valid distributions over X . A general formulation for learning generative models is minimizing a certain statistical divergence D(•||•) between the two distributions as follows: min pg∈H D(p d ||p g ), where H ⊂ P X is the hypothesis class, for instance, a set of distributions defined by neural networks in a deep generative model (DGM). Notably, the divergence in Eq. ( 1) is estimated by the Monte Carlo method over the training set S and its solution has a small bias if not zero. However, learning a DGM with limited data is challenging because solving the problem in Eq. ( 1) with a small sample size m essentially suffers from a large variance and probably overfits (Mohri et al., 2018) . Inspired by the classical perspective of the bias-variance tradeoff, we propose to leverage an external model f pre-trained on a related and large dataset (e.g., ImageNet) as a data-dependent regularization to reduce the variance of training a DGM on limited data (e.g., CIFAR-10), which is

availability

//github.com/ML

