IMPROVING ADVERSARIAL ROBUSTNESS BY CON-TRASTIVE GUIDED DIFFUSION PROCESS Anonymous

Abstract

Synthetic data generation has become an emerging tool to help improve the adversarial robustness in classification tasks since robust learning requires a significantly larger amount of training samples compared with standard classification tasks. Among various deep generative models, the diffusion model has been shown to produce high-quality synthetic images and has achieved good performance in improving the adversarial robustness. However, diffusion-type methods are typically slow in data generation as compared with other generative models. Although different acceleration techniques have been proposed recently, it is also of great importance to study how to improve the sample efficiency of generated data for the downstream task. In this paper, we first analyze the optimality condition of synthetic distribution for achieving non-trivial robust accuracy. We show that enhancing the distinguishability among the generated data is critical for improving adversarial robustness. Thus, we propose the Contrastive-Guided Diffusion Process (Contrastive-DP), which adopts the contrastive loss to guide the diffusion model in data generation. We verify our theoretical results using simulations and demonstrate the good performance of Contrastive-DP on image datasets.

1. INTRODUCTION

The success of most deep learning methods relies heavily on a massive amount of training data, which can be expensive to acquire in practice. For example, in autonomous driving (O'Kelly et al., 2018) and the medical diagnosis (Das et al., 2022) type applications, the number of rare scenes is usually very limited in real data. Moreover, it may be expensive to label the data in supervised learning. These challenges call for methods that can produce additional training data that satisfy two essential properties: (i) the additional data should help improve the downstream task performance; (ii) the additional data should be easy to generate. Synthetic data generation based on deep generative models has shown promising performance recently to tackle these challenges (Sehwag et al., 2022; Gowal et al., 2021; Das et al., 2022) . In synthetic data generation, one aims to learn a synthetic distribution (from which we generate synthetic data) that is close to the true date-generating distribution based on training data available, and most importantly, can help improve the downstream task performance. Synthetic data generation is highly related to generative models. Among various kinds of generative models, the score-based model and diffusion type models have gained much success in image generation recently (Song & Ermon, 2019; Song et al., 2021b; 2020; Song & Ermon, 2020; Sohl-Dickstein et al., 2015; Nichol & Dhariwal, 2021; Bao et al., 2022; Rombach et al., 2022) . As validated in image datasets, the prototype of diffusion models, the Denoising Diffusion Probabilistic Model (DDPM) (Ho et al., 2020) , and many variants can generate high-quality image data as compared with classical generative models such as GANs (Dhariwal & Nichol, 2021) . This paper mainly focuses on the adversarial robust classification of image data, which typically requires more training data than standard classification tasks. In Gowal et al. (2021) , 100M highquality synthetic images are generated by DDPM and achieve the state-of-the-art performance on adversarial robustness on the CIFAR-10 dataset, which demonstrates the effectiveness of diffusion models in improving adversarial robustness. However, a major drawback of diffusion-type methods is the slow computational speed. More specifically, DDPM is usually 1000 times slower than GAN (Song et al., 2021a) and this drawback is more serious when generating a large number of samples, e.g., it takes more than 99 GPU daysfoot_0 for generating 100M image data according to Gowal et al. (2021) . Moreover, the computational costs will also increase dramatically when the resolution of images increases, which inspires a plentiful of works studying how to accelerate the diffusion models (Song et al., 2021a; Watson et al., 2022; Ma et al., 2022; Salimans & Ho, 2022; Bao et al., 2022; Cao et al., 2022; Yang et al., 2022) . In this paper, we aim to study the aforementioned problem from a different perspective -"how to generate effective synthetic data that are most helpful for the downstream task?". We analyze the optimal synthetic distribution for the downstream tasks to improve the sample efficiency of the generative model. We first study the theoretical insights for finding the optimal synthetic distributions for achieving adversarial robustness. Following the setting considered in Carmon et al. (2019) , we introduce a family of synthetic distributions controlled by the distinguishability of the representation from different classes. Our theoretical results show that the more distinguishable the representation is for the synthetic data, the higher the classification accuracy we will get when training a model on such synthetic data sets. Motivated by the theoretical insights, we propose the Contrastive-Guided Diffusion Process (Contrastive-DP) for efficient synthetic data generation, incorporating the contrastive learning loss (van den Oord et al., 2018; Chuang et al., 2020; Robinson et al., 2021) into the diffusion process. We conduct comprehensive simulations and experiments on real image datasets to demonstrate the effectiveness of the proposed Contrastive-DP. The remainder of the paper is organized as follows. Section 2 presents the problem formulation and preliminaries on diffusion models. Section 3 contains the theoretical insights of optimal synthetic distribution under the Gaussian setting. Section 4 proposes a new type of data generation procedure that combines contrastive learning with diffusion models, as motivated by the theoretical insights obtained in Section 3. Finally, Section 5 conducts extensive numerical experiments to validate the good performance of the proposed generation method on simulation and image datasets.

2. PROBLEM FORMULATION AND PRELIMINARIES

We first give a brief overview of adversarial robust classification, which is our main focus, but the whole framework is widely applicable to other downstream tasks in general. Denote the feature space as X , the corresponding label space as Y, and the true (joint) data distribution as D = D X ×Y . Assume we have labeled training data D train := {(x i , y i )} n i=1 . We aim to learn a robust classifier f θ : X → Y, parameterized by a learnable θ, that can achieve minimum adversarial loss: min θ L adv (θ) := E (x,y)∼D max δ∈∆ ℓ(x + δ, y, θ) , where ℓ(x, y, θ) = 1{y ̸ = f θ (x)} is the 0-1 loss function, 1{•} is the indicator function, and ∆ = {δ : ∥δ∥ ∞ ≤ ϵ} is the adversarial set defined using ℓ ∞ -norm. Intuitively, the solution to (1) is a robust classifier that minimizes the worst-case loss within an ϵ-neighborhood of the input features. In the canonical form of adversarial training, we train the robust classifier f θ on the training set D train := {(x i , y i )} n i=1 by solving the following sample average approximation of (1): min θ L adv (θ) := 1 n n i=1 max δi∈∆ ℓ(x i + δ i , y i , θ). (2)

2.1. ADVERSARIAL TRAINING USING SYNTHETIC DATA

Synthetic data generation is one way to artificially increase the size of the training set by generating a sufficient amount of additional data, thus helping improve the learning algorithm's performance (Gowal et al., 2021) . The mainstream generation procedures can be categorized into two types: (i) generate the features (x) first and then assign pseudo labels to the generated features; (ii) or perform conditional generation conditioned on the desired label. Our analysis is mainly based on the former paradigm, which can be easily generalized to the conditional generation procedure, and our proposed algorithm is also flexible enough for both pipelines. Denote the distribution of the generated features as D X and the generated synthetic data as D syn := {( xi , ỹi )} ñ i=1 . Here the feature values xi are generated from the synthetic distribution D X , and ỹi are pseudo labels assigned by a classifier learned on the training data D train . Combining the synthetic and real data, we will learn the robust classifier using a larger training set D all := D train ∪ D syn which now contains n + ñ samples: min θ η 1 n n i=1 max δi∈∆ ℓ(x i + δ i , y i , θ) + (1 -η) 1 ñ n i=1 max δi∈∆ ℓ( xi + δ i , ỹi , θ) , where η ∈ (0, 1) is a parameter controlling the weights of synthetic data.

2.2. DIFFUSION MODEL FOR SYNTHETIC DATA GENERATION

We build our proposed generation procedure based on the Denoising Diffusion Probabilistic Model (DDPM) (Ho et al., 2020) and its accelerated variant Denoising Diffusion Implicit Model (DDIM) (Song et al., 2021a) . In the following, we briefly review the key components of DDPM. The core of DDPM is a forward Markov chain with Gaussian transitions q(x t |x t-1 ) to inject noise to the original data distribution q(x 0 ). More specifically, Ho et al. (2020) model the forward Gaussian transition as: q (x t |x t-1 ) := N ( √ α t x t-1 , (1 -α t ) I) , where α t , t = 1, 2, . . . , T is a decreasing sequence to control the variance of injected noise, and I is the identity covariance matrix. The joint likelihood for the above Markov chain can be written as q (x 0:T ) = q (x 0 ) T t=1 q (x t |x t-1 ). DDPM then assumes we have p θ (x 0:T ) = p θ (x T ) T t=1 p θ (x t-1 |x t ) for the reverse process, where p θ (x t-1 |x t ) is parameterized using a neural network. The training objective is to minimize the Kullback-Leibler (KL) divergence between the forward and reverse process, D KL (q (x 0:T ) , p θ (x 0:T )), which can be simplified as: min θ E t,x0,ϵ ϵ -ϵ θ √ ᾱt x 0 + √ 1 -ᾱt ϵ, t 2 , where x 0 ∼ q(x 0 ), ᾱt = t s=1 α s for t = 1, . . . , T , ϵ ∼ N (0, I), and ϵ θ (x, t) denotes the neural network parameterized by θ to be learned. We refer to Ho et al. (2020) for the detailed algorithms. After learning the time-reversed process parameterized by θ, the original generation process in Ho et al. (2020) is a time-reversed Markov chain as follows: x t-1 = 1 √ α t x t - 1 -α t √ 1 -ᾱt ϵ θ (x t , t) + σ t z t , t = T, T -1, . . . , 1, where z t ∼ N (0, I) if t > 1 and z t = 0 if t = 1. DDIM (Song et al., 2021a ) speeds up the above procedure by generalizing the diffusion process to a non-Markovian process, leading to a sampling trajectory much shorter than T . DDIM carefully designs the forward transition q(x t-1 |x t , x 0 ) such that q (x t |x 0 ) = N √ α t x 0 , (1α t ) I for all t = 1, . . . , T . The great advantage of DDIM is that it admits the same training objective as DDPM, which means we can adapt the pre-trained model of DDPM and accelerate the sampling process without additional cost. The key sample-generating step in DDIM is as follows: x t-1 = √ α t-1 x t - √ 1 -α t ϵ θ (x t , t) √ α t predicted x0 + 1 -α t-1 • ϵ θ (x t , t) pointing to xt , in which we can generate x t-1 using x t and x 0 . Also, the generating process becomes deterministic.

3. THEORETICAL INSIGHTS: OPTIMAL SYNTHETIC DISTRIBUTION

In this section, we consider a concrete distributional model as used in Carmon et al. (2019) ; Schmidt et al. (2018) , and demonstrate the advantage of refining the synthetic data generation process -using the optimal distribution for synthetic data generation can help reduce the sample complexity needed for robust classification. This provides theoretical insights and motivates the proposed generation method to be introduced in Section 4. We focus on learning a robust linear classifier under such setting. The family of linear classifiers is represented as f θ (x) = sign(θ ⊤ x). Recall that we first generate features and then assign pseudo labels to the features. Therefore, a self-learning paradigm is adopted here (Wei et al., 2020)  := D train ∪ D syn = {{(x i , y i )} n i=1 , {( xi , ỹi )} ñ i=1 } to obtain an approximate optimal solution θfinal as: θfinal = 1 n + ñ ( n i=1 y i x i + ñ j=1 ỹj xj ). Note that the final linear classifier θfinal depends on the synthetic data generated from D. We aim to study which synthetic distribution D can help reduce the adversarial classification error (also called robust error) err robust (f θfinal ) := P (x,y)∼D (∃δ ∈ ∆, f θfinal (x + δ) ̸ = y), where ∆ = {δ : ∥δ∥ ∞ ≤ ϵ}. And we similarly define the standard error as err standard (f θfinal ) := P (x,y)∼D (f θfinal (x) ̸ = y) which will be used later. Remark 1 (Comparison with existing literature). In Carmon et al. (2019) ; Deng et al. (2021) , sample complexity results are proposed based on the same Gaussian mixture setting. The major difference is that they all assume the learned linear classifier θfinal is only learned from synthetic data D syn rather than the combination of the real and synthetic data D all . In general, our theoretical setup matches well with the practical algorithms.

3.2. THEORETICAL INSIGHTS FOR OPTIMAL SYNTHETIC DISTRIBUTION

We first study the desired properties of the synthetic distribution D that can lead to a better adversarial classification accuracy when the additional synthetic sample D syn is used in the training stage. In Carmon et al. (2019) , the case D = D is studied, i.e., they consider the case that additional unlabeled data from the true distribution D is available, and they characterize the usefulness of those additional training data. Compared with from Carmon et al. (2019) , we consider general distributions D which does not necessarily equal to D. First note that by the Bayes rule, the optimal decision boundary for the true data distribution is given by µ ⊤ x = 0. Therefore, we restrict our attention to synthetic data distributions that satisfy: (i) the marginal distribution of the label ỹ is also uniform in Y, same as D; (ii) the conditional probability densities p(x|ỹ = 1) and p(x|ỹ = -1) of the synthetic data distribution are symmetric around the true optimal decision boundary µ ⊤ x = 0. More specifically, we start with a special case of the synthetic data distribution D X = 0.5N ( μ, σ 2 I) + 0.5N (-μ, σ 2 I) (note that when μ = cµ for some constant c, the above two conditions are all satisfied). In the following proposition, we present several representative scenarios of synthetic distributions in terms of how they may contribute to the downstream classification task. Figure 1 gives a pictorial demonstration for different cases. Proposition 1. Consider a special form of synthetic distributions D X = 0.5N ( μ, σ 2 I) + 0.5N (-μ, σ 2 I) and assume { x1 , . . . , xñ } are samples from D X . We follow the self-learning paradigm described in Section 3.1 to learn the classifier f θfinal , when ñ is sufficiently large we have: Case 1: Inefficient D X . When ⟨ μ, µ⟩ = 0, the standard error err standard (f θfinal ) achieves the maximum and when ⟨ μ, µ -ε1 d ⟩ = 0, the robust error err robust (f θfinal ) achieves the maximum. Proof of Lemma 3. We follow the similar proof in Carmon et al. (2019) . Let b i be the indicator that the i th pseudo-label is incorrect, so that xi ⇠ N (1 2b i ) ỹi μ, 2 I , and let := 1 ñ n X i=1 (1 2b i ) 2 [ 1, 1]. We may write the final estimator as ✓final = 1 ñ ñ X i=1 ỹi xi = μ + 1 ñ ñ X i=1 ỹi " i where " i ⇠ N 0, 2 I independent of each other. Defining ˜ := ✓final μ, then we have a detailed discussion about the inverse of the term inside Gaussian error function.

1.. ✓final

2 ⇣ µ > ✓final ⌘ 2 = k ˜ + μk 2 ⇣ hµ, μi + µ T ˜ ⌘ 2 . When h μ, µi, k ✓final k 2 (µ > ✓final ) 2 achieves its maximum and plug it in Equation 6, the standard error err standard ⇣ f ✓final ⌘ achieves its maximum, which proves the first part. 2 We need to assume µi is larger than ". Proof of Lemma 2. It is a straight forward conclusion derived from equation 6 and 3.1 2 . We can further extend beyond Gaussian setting, like other heavy-tail distributions. Based on the previous lemmas, below we summarize the performance guarantee for classification errors under several representative scenarios of synthetic distributions. Proof of Lemma 3. We follow the similar proof in Carmon et al. (2019) . Let b i be the indicator that the i th pseudo-label is incorrect, so that xi ⇠ N (1 2b i ) ỹi μ, 2 I , and let := 1 ñ n X i=1 (1 2b i ) 2 [ 1, 1]. We may write the final estimator as ✓final = 1 ñ ñ X i=1 ỹi xi = μ + 1 ñ ñ X i=1 ỹi " i where " i ⇠ N 0, 2 I independent of each other. Defining ˜ := ✓final μ, then we have a detailed discussion about the inverse of the term inside Gaussian error function. 2 ⇣ µ > ✓final ⌘ 2 = k ˜ + μk 2 ⇣ hµ, μi + µ T ˜ ⌘ 2 . When h μ, µi, k ✓final k 2 (µ > ✓final ) 2 achieves its maximum and plug it in Equation 6, the standard error err standard ⇣ f ✓final ⌘ achieves its maximum, which proves the first part. 2 We need to assume µi is larger than ". 6 Simulation results. To verify the findings in Proposition 1, we conduct extensive simulation experiments for a Gaussian example with varying data dimensions, sample sizes, and the position of μ. In Table 1 , we demonstrate the clean and robust accuracy learned on synthetic distribution with the angle between µ and ϵ1 d equals 0 • . More experiment results (with 30 • , 60 • , and 90 • ) can be found in Table 7 , 8, 9 and 10 in Appendix B. In most cases, the classifier learned from the synthetic distribution with μ = cµ with c > 1 achieves better performance even than the iid samples. Proposition 1 and the corresponding simulation results in Table 1 show that the synthetic data can help improve the classification task especially when the representation of different classes is more distinguishable in the synthetic distribution. Therefore, contrastive loss (van den Oord et al., 2018) can be adopted to explicitly control the distances of the representation of different classes. Therefore, we propose a variant of the classical diffusion model, named Contrastive-Guided Diffusion Process (Contrastive-DP), to enhance the sample efficiency of the generative model. In this section, we first present the overall algorithm of the proposed Contrastive-DP procedure in Section 4.1, then we describe the detailed design of the contrastive loss in Section 4.2.

4.1. CONTRASTIVE-GUIDED DIFFUSION PROCESS

The detailed generation procedure of Contrastive-DP is given in Algorithm 1. We highlight below some major differences between the proposed Contrastive-DP and the vanilla DDIM algorithm. In each time step t of the generation procedure, given the current value x (i) t , we add the gradient of the contrastive loss ℓ contra (x (i) t , x (i) p ; τ ) with respect to x (i) t to the original diffusion generative process, here x (i) p is the positive pair of x (i) t (will be explained in detail later), τ is the temperature for softmax, and λ is the hyperparameter balancing the contrastive loss within the diffusion process. This modification ensures that the generated data will be distinguishable among data in the same batch. The construction of the contrastive loss ℓ contra (•) is very flexible -we can adopt multiple forms of contrastive loss together with different selection strategies of positive and negative pairs, which will be discussed in detail in the following.

4.2. CONTRASTIVE LOSS FOR DIFFUSION PROCESS

Algorithm 1 Generation in Contrastive-guided Diffusion Process (Contrastive-DP) 1: X T = {x  ∆x (i) t = λ • ∇ x (i) t ℓ contra (x (i) t , x (i) p ; τ ) + ϵ θ (x (i) t , t) 7: x (i) t-1 = √ α t-1 ( x (i) t -1-αt∆x (i) t √ αt ) + √ 1 -α t-1 • ∆x (i) t 8: t = t -1 9: end for 10: end while 11: return X 0 = {x (i) 0 } m i=1 Let X = x 1 , ..., x m be a minibatch of training data. We apply the contrastive loss to the embedding space. Assume f (•) is the feature extractor that maps the input data in X onto the embedding space. In general, we adopt two forms of the contrastive loss ℓ contra (x (i) t , x (i) p ; τ ) used in Algorithm 1. First is the InfoNCE loss: ℓ InfoNCE (x a , x p ; τ ) = -log(g τ (x a , x p )/ m k=1 1 k̸ =a g τ (x a , x k )) , where m is the batch size, τ is the temperature for softmax, x a , x p denote the anchor and the positive pair, respectively, g τ (x, x ′ ) = exp(f (x) ⊤ f (x ′ )/τ ), and all images except the anchor x a in the minibatch X is negative pairs. InfoNCE loss is an unsupervised learning metric and does not explicitly distinguish the representation from different classes, which implicitly regards the representation from the same class as negative pair. Second is the hard negative mining loss: ℓ HNM (x a , x p ; τ ) =log(g τ (x a , x p )/(g τ (x a , x p ) + m/τ -(E xn∼q β [(g τ (x a , x n )] -τ + E v∼q + β [(g τ (x a , v)]))) , where m denotes the batch size, τ -= 1τ + denotes the probability of observing any different class with x a and q β is an unnormalized von Mises-Fisher distribution (Jammalamadaka, 2011), with mean direction f (x) and "concentration parameter" β to control the hardness of negative mining; q β and q + β can be easily approximated by Monte-Carlo importance sampling techniques. We refer to Chuang et al. (2020) ; Robinson et al. (2021) for detailed descriptions of hard negative mining contrastive loss. Compared with the In-foNCE loss that does not consider class/label information, the hard negative mining (HNM) loss enhances the discriminative ability of different classes in the feature space. It is worth mentioning that the Contrastive-DP enjoys the plugin-type property -it does not modify the original training procedure of diffusion processes and can be easily adopted to various kinds of diffusion models. Numerical Validations. We first demonstrate the effectiveness of contrastive-DP in Figure 2 

5. REAL-WORLD IMAGE DATASETS

In this section, we demonstrate the effectiveness of the proposed contrastive guided diffusion process for synthetic data generation in adversarial classification tasks. We first compare the performance of Contrastive-DP with the vanilla DDIM methods in Section 5.1. Then, we present a comprehensive ablation study on the performance of Contrastive-DP to shed insights on how to adopt the contrastive loss functions on the diffusion model in Section 5.2, especially on which contrastive loss gives the best performance on the diffusion process and how to choose the hyperparameter λ to control the strength of the guidance of the contrastive loss.

5.1. EXPERIMENTAL RESULTS

We test the contrastive-DP algorithm on two image datasets, the CIFAR-10 dataset (Krizhevsky, 2009) and Traffic Signs dataset (Houben et al., 2013) . CIFAR-10 dataset contains 50k training images in 10 classes and 10K images for testing, while Traffic signs dataset contains 39252 training images in 43 classes and 12629 images for testing. For CIFAR-10 dataset, we generate 50K, 200K, and 1M additional images together with the original training images for adversarial training, while for Traffic signs dataset, we synthetic 50k images. To demonstrate our Contrastive-DP algorithm is flexible to be adopted to various kinds of diffusion models and make use of the existing pretrained models, we establish our Contrastive-DP algorithm on the unconditional DDIM for CIFAR-10 datasets and on the conditional DDPM for Traffic signs dataset. The detailed description of the pipeline for generating data and the corresponding hyperparameter can be found in Appendix D.2. Table 2 demonstrates the effectiveness of our contrastive-DP algorithm on the CIFAR-10 dataset 2 , which achieves better robust accuracy on all data regimes than the vanilla DDIM. All of the results are higher than the baseline result without synthetic data by a large margin (+4.37% in 50K setting and +7.3% in 1M setting). Table 3 demonstrates the effectiveness of our contrastive-DP algorithm on the Traffic Signs dataset. Our contrastive-DP achieves better clean and robust accuracy than the vanilla DDPM model and is also higher than the baseline result without synthetic data by a large margin (+9.96%). Table 2 : The clean and adversarial accuracy on the CIFAR-10 dataset. The robust accuracy is reported by the worst accuracy obtained by either AUTOATTACK (Croce & Hein, 2020) or AA+MT (Gowal et al., 2020) 

5.2. ABLATION STUDIES

Sensitivity of λ. Table 4 shows the influence of the strength of the contrastive loss. λ = 100k gives consistently better results than a smaller λ = 50k or a larger λ = 200k on robust accuracy on all settings. Moreover, we find the larger the λ is, the better performance we get on clean accuracy when the additional data is small (50K case), while the smaller the λ is, the better performance we get on clean accuracy when the additional data is large (1M case). The effectiveness of different contrastive losses. Table 5 demonstrates the performance of different design of the contrastive loss. We find out that applying the hard negative mining together with the embedding network achieves better clean and robust accuracy when the additional data is small (50K and 200K setting), while the infoNCE loss achieves better clean and robust accuracy when the additional data is large (1M setting). This result shows that we can improve the sample efficiency of the generative model by carefully designing the contrastive loss. Data selection for synthetic data. Data selection methods are worthy of study since, in practice, we would like to know whether we can achieve better performance by generating a large number of samples and applying some selection criteria to filter out some samples. Therefore, we propose several data selection criterion and evaluate corresponding effectiveness in Table 6 . All of the selection methods on Contrastive-DP are higher than vanilla DDIM plus selection methods, which demonstrates the superiority of using the contrastive learning loss as the guidance rather than using selection methods on the images generated by the vanilla diffusion model. 2022), it was claimed that the transferability of adversarial robustness between two data distributions is measured by conditional Wasserstein distance, which inspires us to use it as a criterion for selecting samples. Our work follows the same line, but we investigate how to generate the samples with high information rather than applying the selection to the data generated by the vanilla diffusion model. Below we also summarize some closely related work in different lines. Sample-efficient generation. We can view the sample-efficient generation problem as a Bi-level optimization problem. We can regard how to synthesize data as the meta objective and the performance of the model trained on the synthetic data as the inner objective. Theoretical analysis of adversarial robustness. In Schmidt et al. (2018) , the sample complexity of adversarial robustness has been shown to be substantially larger than standard classification tasks in the Gaussian setting. Carmon et al. (2019) bridges this gap by using the self-training paradigm and corresponding unlabeled data. Deng et al. (2021) further extends the aforementioned conclusion by leveraging out-of-domain unlabeled data. None of the works mentioned above investigates the optimal distribution for unlabeled synthetic data. Contrastive learning. Contrastive learning algorithms have been widely used for representation learning (Chen et al., 2020; He et al., 2020; Grill et al., 2020) . The vanilla contrastive learning loss, InfoNCE (van den Oord et al., 2018) , aims to draw the distance between positive pairs and push the negative pairs away. To mitigate the problem that not all negative pairs may be true negatives, the negative hard mining criterion was proposed in (Chuang et al., 2020; Robinson et al., 2021) .

7. CONCLUSION

In this paper, we delve into which kind of synthetic distribution is optimal for the downstream task, especially for achieving adversarial robustness in image data classification. We derive the optimality condition under the Gaussian setting and propose the Contrastive-guided Diffusion Process (Contrastive-DP), a plug-in algorithm suitable for various types of diffusion models. We verify our theorem on the Gaussian simulation and demonstrate the superiority of the Contrastive-DP algorithm on the image datasets. Proof of Proposition 1. We follow the proof strategy in Carmon et al. (2019) . Let b i be the indicator that the i-th pseudo-label ỹi assigned to xi is incorrect, so that we have xi ∼ N (1 -2b i ) ỹi μ, σ 2 I . Let γ := 1 ñ n i=1 (1 -2b i ) ∈ [-1 , 1] and α := ñ ñ+n . Note that the true data samples x i ∼ N y i µ, σ 2 I , thus we may write the final estimator as θfinal = 1 n + ñ ( ñ j=1 ỹj xj + n i=1 y i x i ) = αγ μ + 1 n + ñ ñ i=1 ỹi εi + (1 -α)µ + 1 n + ñ n i=1 y i ε i = αγ μ + (1 -α)µ + 1 n + ñ ( n i=1 y i ε i + ñ i=1 ỹi εi ), where ε i , εi ∼ N 0, σ 2 I independent of each other, and the marginal probability density p(ỹ) matches p(y) well. Defining δ := θfinalαγ μ -(1α)µ. Note that δ ∼ N (0, 1 n+ñ σ 2 ). By ( 6), we have that the standard error of f θfinal is a non-increasing function of µ ⊤ θfinal σ∥ θfinal ∥ . Note that when ñ is large enough, we have the direction of θfinal approach the direction of μ. Therefore, the statement in Case 1 holds as a consequence, and similarly for the robust error according to (7). The remaining proof on Case 2 and Case 3 is based on a detailed discussion for the squared inverse of the term µ ⊤ θfinal σ∥ θfinal ∥ : ∥ θfinal ∥ 2 (µ ⊤ θfinal ) 2 = ∥ δ + αγ μ + (1 -α)µ∥ 2 (αγ⟨µ, μ⟩ + µ ⊤ δ + (1 -α)∥µ∥ 2 ) 2 . ( ) Note that the larger the quantity in ( 8) is, the larger the standard error of f θfinal . Case 2. Assume μ = cµ. Then we have (8) reduces to: ∥ θfinal ∥ 2 (µ ⊤ θfinal ) 2 = ∥ δ + (1 -α + cγα)µ∥ 2 (1 -α + cγα)∥µ∥ 2 + µ ⊤ δ 2 (9) = 1 ∥µ∥ 2 + ∥ δ + (1 -α + cγα)µ∥ 2 -1 ∥µ∥ 2 (1 -α + cγα)∥µ∥ 2 + µ ⊤ δ 2 (1 -α + cγα)∥µ∥ 2 + µ ⊤ δ 2 = 1 ∥µ∥ 2 + ∥ δ∥ 2 -1 ∥µ∥ 2 (µ ⊤ δ) 2 (1 -α + cγα)∥µ∥ 2 + µ ⊤ δ 2 , which demonstrates that the bigger the c is, the smaller the standard error err standard (f θfinal ) is, which verifies the second part of Case 2. Case 3. Assume μ = c(µ -ε1 d ). Similar to Case 2, we rewrite the term inside the robust error function (7) as: ∥ θfinal ∥ 2 (µ -ε1 d ) ⊤ θfinal 2 = ∥ δ + (1 -α + cγα)(µ -ε1 d )∥ 2 (1 -α + cγα)∥µ -ε1 d ∥ 2 + (µ -ε1 d ) ⊤ δ 2 = 1 ∥µ -ε1 d ∥ 2 + ∥ δ∥ 2 - 1 ∥µ-ε1 d ∥ 2 (µ -ε1 d ) ⊤ δ 2 (1 -α + cγα)∥µ -ε1 d ∥ 2 + (µ -ε1 d ) ⊤ δ 2 , ( ) which demonstrates the bigger the c is, the smaller the robust error err robust (f θfinal ) is, which proves the second part of Case 3.

B MORE SIMULATION RESULTS UNDER GAUSSIAN SETTING IN SECTION 3

In this section, we present more detailed simulation results under the Gaussian setting in Section 3 to demonstrate different scenarios in Proposition 1. Table 7 and Table 8 show the clean and robust accuracy learned on synthetic distribution μ = cµ with different angles between µ and ϵ1 d . Table 10 shows the clean and robust accuracy learned on synthetic distribution μ = c(µ -ε1 d ) with different angles between µ and ϵ1 d . Recall that µ is (one of) the optimal linear classifier that maximize the clean accuracy under the true distribution considered in Section 3, similarly µ -ϵ1 d is the optimal solution for robust accuracy. Therefore, different angles between µ and ϵ1 d represent different trade-offs between the clean and robust accuracy. For example, when the angle between µ and ϵ1 d is 0 degrees, i.e., µ = c1 d , we have that the optimal solution for clean accuracy and robust accuracy are the same. In most cases, the classifier learned from the synthetic distribution that is most separable achieves better performance even than the iid samples, which verifies Proposition 1. 



Running on a RTX 4x2080Ti GPU cluster. Since the Pytorch Implementation ofGowal et al. (2021) is not open source, we utilize the best unofficial implementation to reconduct all the experiments for a fair comparison. We refer to Appendix D.2 for a detailed explanation of the quadratic version. For Table in the ablation studies subsection, we use batch size with 256. https://github.com/VSehwag/minimal-diffusion



classification task where X = R d , Y = {-1, 1}. The true data distribution D = D X ×Y is specified as follows. The marginal distribution for label y is uniform in Y, and the conditional distribution of features is x|y ∼ N (yµ, σ 2 I d ), where µ ∈ R d is non-zero, and I d is the d dimensional identity covariance matrix. Assume we generate a set of synthetic data from another synthetic distribution D.

Published as a conference paper at ICLR 2022 Lemma 2. For any D in the class of optimal distribution D⇤ , the smaller the variance D is, the smaller sample complexity D has.Proof of Lemma 2. It is a straight forward conclusion derived from equation 6 and 3.1 2 . We can further extend beyond Gaussian setting, like other heavy-tail distributions.Based on the previous lemmas, below we summarize the performance guarantee for classification errors under several representative scenarios of synthetic distributions. Lemma 3. Consider a special form of synthetic distributions DX = 0.5N ( μ, 2 I) + 0.5N ( μ, 2 I) and suppose { x1 , x2 , • • • xñ } are samples from DX . We generate pseudo-labels ỹi = sign( ✓T intermediate xi ), i = 1, • • • , ñ, using the intermediate classifier ✓intermediate = 1 n P n i=1 y i x i learned from real data. Then, we learn ✓final on Dñ = {( x1 , ỹ1 ), • • • , ( xñ , ỹñ )} by ✓final = 1 ñ P ñ i=1 ỹi xi . We have • h μ, µi = 0, error achieves maximum ⇥ • μ = cµ, error achieves minimum X c % ) error % 1. When h μ, µi = 0, the standard error err standard ⇣ f ✓final ⌘ achieves the maximum and when h μ, µ "1 d i = 0, the robust error err robust ⇣ f ✓final ⌘ achieves the maximum. 2. When μ = cµ, where c is a positive scalar, the standard error err standard ⇣ f ✓final ⌘ achieves the minimum, while the bigger the c is, the smaller the err standard ⇣ f ✓final ⌘ is. 3. When μ = c(µ "1 d ), where c is a positive scalar, the robust error err 1," robust ⇣ f ✓final ⌘ achieves the minimum, while the bigger the c is, the small the err 1,"

Published as a conference paper at ICLR 2022 Lemma 2. For any D in the class of optimal distribution D⇤ , the smaller the variance D is, the smaller sample complexity D has.

Consider a special form of synthetic distributions DX = 0.5N ( μ, 2 I) + 0.5N ( μ, 2 I) and suppose { x1 , x2 , • • • xñ } are samples from DX . We generate pseudo-labels ỹi = sign( ✓T intermediate xi ), i = 1, • • • , ñ, using the intermediate classifier ✓intermediate = 1 n P n i=1 y i x i learned from real data. Then, we learn ✓final on Dñ = {( x1 , ỹ1 ), • • • , ( xñ , ỹñ )} by ✓final = 1 ñ P ñ i=1 ỹi xi . We have • h μ, µi = 0, error achieves maximum ⇥ • μ = cµ, error achieves minimum X c % ) error % 1. When h μ, µi = 0, the standard error err standard ⇣ f ✓final ⌘ achieves the maximum and when h μ, µ "1 d i = 0, the robust error err robust ⇣ f ✓final ⌘ achieves the maximum. 2. When μ = cµ, where c is a positive scalar, the standard error err standard ⇣ f ✓final ⌘ achieves the minimum, while the bigger the c is, the smaller the err standard ⇣ f ✓final ⌘ is. 3. When μ = c(µ "1 d ), where c is a positive scalar, the robust error err 1," robust ⇣ f ✓final ⌘ achieves the minimum, while the bigger the c is, the small the err 1," robust ⇣ f ✓final ⌘ is.

Figure 1: Demonstration of Proposition 1.Case 2: Optimal D X for clean accuracy. When μ = cµ for c > 0, err standard (f θfinal ) achieves the minimum, and the bigger the c is, the smaller the err standard (f θfinal ).Case 3: Optimal D X for robust accuracy. When μ = c(µ -ε1 d ) for c > 0, the robust error err robust (f θfinal ) achieves the minimum, and the bigger the c is, the smaller theerr robust (f θfinal ).Remark 2 (Comparison with the existing characterization of the synthetic distribution). We briefly comment on the main differences and similarities withDeng et al. (2021), in which a similar result was presented in Theorem 4. InDeng et al. (2021), the optimal solution of θ * was given for minimizing robust error err robust (f θfinal ) and they provides a specific unlabeled distribution μ = µ -ε1 d that achieves asymptotic optimality under certain condition. In this paper, we propose a general family of optimal distribution controlled by a scalar c, which represents the distinguishability of the feature. The optimal θ * proposed in Deng et al. (2021) can be recovered when c → ∞. Therefore, our conclusion points out the optimality condition of unlabeled distribution and inspires a line of work to improve the performance of θfinal by making the feature of unlabeled distribution distinguishable.Table1: Simulation results validating findings in Proposition 1. d = 2 and d = 100 denotes the dimension of x, representing the low-dimensional and high-dimensional cases, respectively. For d = 2, we set ∥µ∥ 2 = 2, ε = 0.5, and for d = 100, we set ∥µ∥ 2 = 4, ε = 0.1. We use "Real" to denote the real data distribution and n to denote the number of data from the real distribution, while we use "c" to denote different synthetic distributions with μ = cµ and use ñ to denote the number of synthetic data. The results and the standard deviation in the bracket are averaged over 50 independent trials.

using a simulation example. Consider the binary classification problem as in Section 3.1, and the real data for each class are generated from a Gaussian distribution. Figure 2(a) demonstrates the synthetic data generated by the vanilla diffusion model, which recovers the ground-truth Gaussian distribution well. When using the contrastive-DP procedure with HNM loss, we obtain the generated synthetic data as shown in Figure 2(b), which is more distinguishable with a much smaller variance.

Figure 2: An illustration of the effectiveness of synthetic distribution guided by contrastive loss. In addition, Figure 3 and Figure 4 in Appendix C.2 demonstrate the synthetic data distribution guided by different kinds of contrastive loss mentioned above. It can be shown that InfoNCE loss and hard negative mining method cannot explicitly distinguish the data within the same class and thus form a circle within each class to maximize the distance between samples, while the conditional version of contrastive loss (given the oracle class information) can make two classes more separable.

For data-augmentation based methods, Ruiz et al. (2019) adopt a reinforcement learning based method for optimizing the generator in order to maximize the training accuracy. For active learning based methods, Tran et al. (2019) use an Auto-Encoder to generate new samples based on the informative training data selected by the acquisition function. Besides, Kim et al. (2020) combines the active learning criterion with data augmentation methods. They use the gradient of acquisition function after one-step augmentation as guidance for training the augmentation policy network.

Figure 3: A comparison of the synthetic distribution guided by different contrastive loss with initialization N (0, I). Real data as positive pair means using the mixture of oracle distribution N (±1, I)and the data in the same batch as negative pair, while real data as negative pair means using the data in the same batch as positive pair and using the mixture of oracle distribution as negative pair.

Figure 4: A comparison of the synthetic distribution guided by different contrastive loss with initialization N (0, 4I).

Figure 5: Comparison of the Contrastive-GP with the vanilla DDIM. The image in the same position on subfigures (a) and (b) has the same initialization. With the guidance of the contrastive loss, the category of the synthetic images changes, or the synthetic images become more realistic (colorful), which demonstrates the effectiveness of our Contrastive-DP algorithm.

, learned from real data D train , to assign the pseudo-label. Then, the synthetic data D syn = {( x1 , ỹ1 ), . . . , ( xñ , ỹñ )}, where ỹi = sign( θ⊤ inter x i ), i = 1, . . . , ñ. We combine the real data and synthetic data D all

with ϵ ∞ = 8/255 and WRN-28-10. 50k, 200k, and 1M denote the number of synthetic used for adversarial training.

The clean and adversarial accuracy on the Traffic Signs dataset. he results and the standard deviation in the bracket are averaged over 3 independent trials.

The performance of Contrastive-DP under different λ values.

The performance of Contrastive-DP under different contrastive loss: infoNCE and HNM losses, and w/wo embedding denote with/without an embedding network.

Comparison of different data selection criteria. The detailed explanation of each selection method can be found in Append D.3. . Using generative models to improve adversarial robustness has attracted increasing attention recently. Gowal et al. (2021) uses 100M high-quality images generated by DDPM together with the original training set to achieve state-of-the-art performance on the CIFAR-10 dataset. They propose to use Complementary as an important metric for measuring the efficacy of the synthetic data. In Sehwag et al. (

The clean and robust accuracy learned on synthetic distribution μ = cµ when d = 2 and the angle between µ and ϵ is 0 degrees and 90 degrees. "Real" denotes the real data distribution, and n denotes the number of data from the real distribution, while we use "c" to denote different synthetic distributions and use ñ to denote the number of synthetic data. The results and the standard deviation in the bracket are the results of 50 independent trials.

The clean and robust accuracy learned on synthetic distribution μ = cµ when d = 2 and the angle between µ and ϵ is 30 degrees and 60 degrees. "Real" denotes the real data distribution, and n denotes the number of data from the real distribution, while we use "c" to denote different synthetic distributions and use ñ to denote the number of synthetic data. The results and the standard deviation in the bracket are the results of 50 independent trials.

The

annex

Here, we briefly recapitulate the closed form formulation for the standard and robust error probabilities as detailed in Carmon et al. (2019) ; Deng et al. (2021) .The standard error probability can be written aswhereis the Gaussian error function and is non-increasing. Clearly the standard error probability is minimized when θ ∥θ∥ = µ ∥µ∥ , i.e., θ = cµ for some scalar c > 0. We may impost ∥θ∥ 2 = 1 to ensure the unique solution θ = µ/ ∥µ∥.The robust error probability under the ℓ ∞ adversarial set ∆ = {δ : ∥δ∥ ∞ ≤ ϵ} isIn the following part, we use a simpler notation err robust (f θ ) for the robust error err ∞,ε robust (f θ ) without ambiguity. The closed-form of the optimal θ * that minimizes the above robust error err robust can be shown to be (Deng et al., 2021) :where T ε (µ) is the hard-thresholding operator with (T ε (µ)) j = sign (µ j ) • max {|µ j |ε, 0}.Under the mild assumption µ j > ε, ∀j ∈ {1, 2, . . . , d}, the optimal solution can be simplified as:Remark 3. Note that when µ = c1 for some constant c > ϵ, the optimal solution θ * = µ-ε1 ∥µ-ε1∥for minimizing the robust error is the same as the optimal solution µ ∥µ∥ for minimizing the standard error. Otherwise, these two solutions are different, representing a trade-off between robustness and accuracy.

A.2 DETAILS FOR THE THEORETICAL ANALYSIS IN SECTION 3

Overall, we would like to design an appropriate synthetic distribution D that can help optimize the adversarial classification accuracy in the downstream task. First note that by Bayes rule, the optimal decision boundary for the true distribution x|y ∼ N (yµ, σ 2 I) is given by µ ⊤ x = 0. Therefore, we restrict our attention to synthetic data distributions that satisfy the following two conditions:1. The marginal probability density p(ỹ) of the synthetic distribution matches p(y) of the real data distribution well.2. The conditional probability densities p(x|ỹ = 1) and p(x|ỹ = -1) of the synthetic data distribution are symmetric around the true optimal decision boundary µ ⊤ x = 0.More specifically, we consider a special case of the synthetic data distribution D X = 0.5N ( μ, σ 2 I) + 0.5N (-μ, σ 2 I).

C THE DETAILED CONSTRUCTION OF THE CONTRASTIVE LOSS

In this section, we first give a detailed description of several possible ways to design contrastive loss, especially in constructing positive and negative pairs. Then, we give a visualization of the synthetic data distributions generated under different contrastive losses.C.1 POSITIVE AND NEGATIVE PAIR SELECTION STRATEGY.In this subsection, we give several possible ways to construct positive and negative pairs.1. Vanilla version: Using all the samples in the minibatch is the common strategy for contrastive learning. In the diffusion process, since for each time step t, we want to distinguish each image from other images in the minibatch at the same time step, a straight-forward strategy is to use all the samples in the minibatch other than x i t at time step t to be the negative pairs. For the positive pairs, we can simply adopt x i t+1 to be the positive pairs rather than augmentation of x i t . 2. Real data as positive pairs: A possible improvement upon the vanilla version is considering we aim to generate images similar to real data. Therefore, we can directly adopt the real data as the positive pairs.3. Real data as negative pairs: Another improvement upon the vanilla version is considering the other images in time step t in the minibatch is not as high quality as the real data. Therefore, we can directly adopt the real data as the negative pairs. 4. Class conditional version: When we use conditional diffusion, and the class label of x t in the minibatch is available, a further improvement can be adopted is to use all the samples with different class label y in the minibatch at time step t to be the negative pairs.

C.2 VISUALIZATION OF THE SYNTHETIC DATA DISTRIBUTION GENERATED BY DIFFERENT DESIGNS OF THE CONTRASTIVE LOSS

In this subsection, we demonstrate the synthetic distributions generated by different designs of the contrastive loss mentioned in Section C.1 on the Gaussian setting mentioned in Section 3.1. Figure 3 shows the synthetic distribution generated by using N (0, I) as initialization, while Figure 4 shows the synthetic distribution generated by using N (0, 4I) as initialization. In all figures, all of the contrastive loss except for conditional hard negative mining form a circle within each class, which means these algorithms cannot explicitly distinguish the data within the same class and thus maximize the distance within each class, while the guidance from conditional hard negative mining can generate samples that are more distinguishable.

D THE EXPERIMENTAL RESULTS IN THE REAL-WORLD SETTING D.1 EXPERIMENTAL SETUP FOR CIFAR-10 DATASET

We describe the pipeline of synthetic data generation for adversarial robustness and a corresponding setting for the CIFAR-10 dataset in this subsection.Dataset. CIFAR-10 dataset Krizhevsky (2009) contains 50K 32x32 color training images in 10 classes and 10K images for testing.Overall training pipeline We follow the same training pipeline as Gowal et al. (2021) , i.e., synthesizing data by using the diffusion model, assigning pseudo-label for synthetic data and aggregate the original data and the synthetic data for adversarial training. We give a careful explanation of these three components as follow.Synthetic data generation by the diffusion model. Considering the advantage of DDIM on generation speed, we base on the official implementation of the DDIM model (Song et al., 2021a) and add the guidance of the contrastive loss. We generate images with 200 steps with batchsize=512, and use the quadratic version of sub-sequence selection 3 . For the guidance of the contrastive loss, we try different designs of the contrastive loss mentioned in Section 4.2. We set the temperature τ = 0.1 and the strength of guidance of the contrastive loss λ = 20k in the InfoNCE loss, while τ = 10, the strength of guidance of the contrastive loss λ = 100k, the probability of the same class in the minibatch τ + = 0.1 and the hardness of negative mining β = 1 in hard negative mining loss. These corresponding hyperparameters are chosen based on some preliminary experiments on image generation. The detailed ablation studies can be found in Section 5.2. Moreover, we also delve into the representation used by contrastive loss. The default setting is to use the pre-trained Wide ResNet-28-10 model Gowal et al. (2021) to get the representation for applying the contrastive loss, which is named as (without embedding) in Section 5.2. A further improvement is to apply a 2-layer feed-forward neural network to encode the representation after the pre-trained model, which is named as (with embedding). The advantage of the latter design is we can adopt the contrastive loss to optimize the encoding network rather than a fixed encoder.LaNet for assigning pseudo-label. Since the DDIM is an unconditional generator, we need to assign the pseudo-label to the generated sample. We follow the same choice adopted by Sehwag et al. (2022) , i.e., using state-of-the-art LaNet Wang et al. (2019) network for assigning the pseudolabel for the synthetic data.Adversarial Training. We follow the same setting as Gowal et al. (2021) Synthetic data generation by the diffusion model. To utilize the pre-trained diffusion model 5 , we use a conditional DDPM for generating samples for Traffic Signs dataset. We adopt the hard negative mining loss with τ = 10, the strength of guidance of the contrastive loss λ = 5k, the probability of the same class in the minibatch τ + = 0.1 and the hardness of negative mining β = 1.We also use the pre-trained Wide ResNet-28-10 model to get the representation for applying the contrastive loss and use a 2-layer feed-forward neural network to encode the representation after the pre-trained model.Adversarial Training. We follow the same setting as the CIFAR-10 dataset, except the training epochs are reduced to 50. We also extend the training epochs to 400 but do not find significant improvement.

D.3 THE DETAILED EXPLANATION OF THE DATA SELECTION METHODS

Below we summarize different data selection methods:• DDIM (Separability): We adopt the separability of the data as a criterion to make the selection of the data generated by vanilla DDIM. For each data, we use a pre-trained WRN-28-10 model to encode them into the embedding space. Then, we compute the L2 distance between each sample and the centroid of all classes (which is easily computed as the mean of all samples in this class) and add them together. To select a subset of samples that are most distinguishable, we choose the top K samples that have the smallest distance in each class.• Contrastive-DP (Gradient norm): We use the gradient norm with respect to a pre-trained WRN-28-10 model as a criterion to make the selection on the data generated by Contrastive-DP. The larger the gradient norm is, the more informative the sample is for learning a downstream model. Therefore, we select the top K samples that have the largest gradient norm in each class.• Contrastive-DP (Gradient norm-rob): Similar to Contrastive-DP (Gradient norm), we use the gradient norm of the robust loss rather than standard classification loss as a criterion to make the selection on the data generated by Contrastive-DP. Therefore, we select the top K samples that have the largest gradient norm in each class.• Contrastive-DP (Entropy): We use the entropy of each sample with respect to LaNet as a criterion to make the selection on the data generated by Contrastive-DP. The smaller the entropy is, the higher likelihood this image has good quality. Therefore, we select the top K samples that have the smallest entropy in each class.

D.4 COMPARISON OF THE IMAGE GENERATED BY CONTRASTIVE-GP WITH THE VANILLA DDIM

In this subsection, we visualize the image generated by Contrastive-GP and the vanilla DDIM on the CIFAR-10 dataset. We find the guidance of the contrastive loss changes the category of the synthetic images or makes the synthetic images realistic (colorful). In this subsection, we visualize the t-SNE of the finial classifier learned on different synthetic data. We find with the guidance of the contrastive loss, the finial classifier learns a better representation that makes the feature of the images from different classes more separable than the finial classifier learned on the images generated by the vanilla DDIM.

