TRANSFERRING PRETRAINED DIFFUSION PROBA-BILISTIC MODELS

Abstract

Diffusion Probabilistic Models (DPMs) achieve impressive performance in visual generative tasks recently. However, the success of DPMs heavily relies on large amounts of data and optimization steps, which limits the application of DPMs to small datasets and limited computational resources. In this paper, we investigate transfer learning in DPMs to leverage the DPMs pretrained on large-scale datasets for generation with limited data. Firstly, we show that previous methods like training from scratch or determining the transferable parts is not suitable for the DPM due to its U-Net based denoising architecture with the external denoising timestep input. To address it, we present a condition-based tuning approach to take full advantages of existing pretrained models. Concretely, we obtain the semantic embeddings of condition images by the pretrained CLIP model, and then inject these semantic informations to the pretrained DPM via a "Attention-NonLinear" (ANL) module. The adaptation to a new task can be achieved by only tuning the ANL module inserted into the pretrained DPM hierarchically. To further enhance the diversity of generated images, we introduce a masked sampling strategy based on the condition mechanism. Extensive experiments validate the effectiveness and efficiency of our proposed tuning approach in generative task transfer and data augmentation for semi-supervised learning.

1. INTRODUCTION

Recently, diffusion probabilistic models (DPMs) (Ho et al., 2020; Song et al., 2020b) have been demonstrated the increasing power of generating complex and high-quality images by introducing a hierarchy of denoising autoencoders. DPMs adopt a diffusion process to gradually inject noise to the data distribution, and then learn the reverse process to generate images by a Markov model. However, the success of diffusion models heavily depends on huge data and computation costs. For instance, training DPMs often needs large-scale datasets (e.g., ImageNet (Krizhevsky et al., 2012) ), and takes hundreds of GPU days in extreme cases (e.g., 150-1000 V100 GPU days in (Dhariwal & Nichol, 2021) ). Considering that it is usually hard to collect adequate training data for a specific domain (Long et al., 2015) , we propose to leverage the released powerful DPMs pretrained on largescale datasets to facilitate downstream generative tasks. This transfer learning paradigm is thoroughly investigated on discriminative tasks in many fields like computer vision (Bengio, 2012; Yosinski et al., 2014; Zamir et al., 2018) and natural language processing (Devlin et al., 2018; Mozafari et al., 2019; Peng et al., 2019) . For computer vision, Yosinski et al. ( 2014) observed that the representations learned by deep convolutional neural networks transit from general (e.g., Gabor filters and color blobs) to specific, which facilitates the "pretraining and fine-tuning" learning paradigm. For natural language processing, reusing the pretrained BERT model achieves impressive performance on multiple downstream tasks (Houlsby et al., 2019) . However, few efforts has been paid to investigate transfer learning in the generative tasks, especially for the DPMs. (Zhao et al., 2020 ) is a recent work, which explores the transferability of generative adversarial networks (GANs). Motivated by the insights from (Yosinski et al., 2014) , they proposed to preserve the low-level layers that capturing properties of generic patterns, and fine-tune the highlevel layers that are associated with the semantic aspects of data. However, the insight of determining the transferable parts is not suitable for DPMs since DPMs often adopt a U-Net based denoising architecture with the external denoising timestep input. DPMs generate images by progressively predicting the statistics of the gaussian noise corresponding to a specific timestep, so it is hard to say which part learns the general or task-specific representations. We have also considered other two baselines: training from scratch and tuning the whole pretrained DPM ("Tuning-All"). Based on empirical observations, we find training from scratch is infeasible for learning with limited data and optimization steps, and Tuning-All is likely to overfit small training sets, resulting in slow convergence. In this work, we present a condition-based tuning approach to achieve fast adaptation from pretrained DPMs to new datasets. To make the adaptation procedure efficient, we propose to take maximum advantages of pretrained models. Firstly, we leverage the vision-text pretrained model CLIP (Radford et al., 2021) to obtain the embedding of input images. It is worth noticing that the CLIP embedding is highly associated with the semantic informations of images. To make use of CLIP embeddings, we consider an "Attention-NonLinear" module (ANL), which injects the external conditions (i.e., the CLIP embedding) to each block of the pretrained DPM. The ANL module consists of a cross-attention and a non-linear mapping module. Note that the ANL module is inserted into each block of DPM, fusing the CLIP embedding and adjusting the DPM hierarchically. Therefore, transferring pretrained DPM to a new dataset can be achieved by fine-tuning the ANL modules only and freezing the pretrained CLIP model and DPM. Introducing the trainable ANL modules into pretrained DPMs enjoys the following advantages: (1) This tuning approach preserves the main structures of pretrained DPMs, but is able to achieve effective transfer since the ANL module hierarchically adjusts each block of the pretrained DPM. (2) Compared to training from scratch and tuning the whole DPM, our approach introduces less trainable parameters and faster adaptation, which is beneficial for learning with limited data and training resources. (3) Based on the CLIP semantic embeddings, our approach can be extended to the language modality like simple text-to-image synthesis. In summary, we have the following contributions: 1. We investigate transfer learning in recent DPMs, and uncover that previous methods like training from scratch or determining the transferable parts are not efficient. To the best of our knowledge, we are at the very early attempt to leverage pretrained DPMs for generation with limited data and training resources.



Figure 1: Generated images by tuning the latent diffusion model Rombach et al. (2022) pretrained on Ima-geNet Krizhevsky et al. (2012) with our proposed tuning approach (batch size=32). We illustrate the generated images from iteration 500 to 4000 with the interval of 500 on three datasets: CelebA Liu et al. (2015), Flowers Nilsback & Zisserman (2008) and StanfordCars Krause et al. (2013).

