DIFFERENTIALLY PRIVATE DATASET CONDENSATION

Abstract

Recent work in ICML'22 builds a theoretical connection between dataset condensation (DC) and differential privacy (DP) and claims that DC can provide privacy protection for free. However, the connection is problematic because of two controversial assumptions. In this paper, we revisit the ICML'22 work and elucidate the issues in the two controversial assumptions. To correctly connect DC and DP, we propose two differentially private dataset condensation (DPDC) algorithms-LDPDC and NDPDC. Through extensive evaluations, we demonstrate that LD-PDC has comparable performance to recent DP generative methods despite its simplicity. NDPDC provides acceptable DP guarantees with a mild utility loss, compared to distribution matching (DM). Additionally, NDPDC allows a flexible trade-off between the synthetic data utility and DP budget. We note that, in this paper, performance specifically refers to the performance in producing high-utility synthetic data for data-efficient model training. * Tc is the subset of T that contains all the data with label c.

1. INTRODUCTION

Although deep learning has pushed forward the frontiers of many applications in the past decade, it still needs to overcome some challenges for broader academic and commercial use (Dilmegani, 2022) . One challenge is the costly process of algorithm design and practical implementation in deep learning, which typically requires inspection and evaluation by training many models on ample data. The growing privacy concern is another challenge. Due to the privacy concern, an increasing number of customers are reluctant to provide their data for the academia or industry to train deep learning models, and some regulations are further created to restrict access to sensitive data. Recently, dataset condensation (DC) has emerged as a potential technique to address the two challenges with one shot (Zhao et al., 2021; Zhao & Bilen, 2021b) . The main objective of dataset condensation is to condense the original dataset into a small synthetic dataset while maintaining the synthetic data utility to the greatest extent for training deep learning models. For the first challenge, both academia and industry can save computation and storage costs if using DC-generated small synthetic datasets to develop their algorithms and debug their implementations. In terms of the second challenge, since the DC-generated synthetic data may not exist in the real world, sharing DC-generated data seems less risky than sharing the original data. Nevertheless, DC-generated data may memorize a fair amount of sensitive information during the optimization process on the original data. In other words, sharing DC-generated data still exposes the data owners to privacy risk. Moreover, this privacy risk is unknown since the prior literature on DC has not proved any rigorous connection between DC and privacy. Although an ICML'22 outstanding paper (Dong et al., 2022) proved a proposition to support the claim that DC can provide certain differential privacy (DP) guarantee for free, the proof is problematic because of two controversial assumptions, as discussed in Section 3.1. To correctly connect DC and DP for bounding the privacy risk of DC, we propose two differentially private dataset condensation (DPDC) algorithms-LDPDC and NDPDC. LDPDC (Algorithm 1) adds random noise to the sum of randomly sampled original data and then divides the randomized sum by the fixed expected sample size to construct the synthetic data. Based on the framework of Rényi Differential Privacy (RDP) (Mironov, 2017; Mironov et al., 2019) , we prove Theorem 3.1 to bound the privacy risk of LDPDC. NDPDC (Algorithm 2) optimizes randomly initialized synthetic data by matching the norm-clipped representations of the synthetic data and the randomized normclipped representations of the original data. For NDPDC, we prove Theorem 3.2 bound to its privacy risk. The potential benefits brought by DPDC algorithms include (i) reducing the cost of data storage and model training; (ii) mitigating the privacy concerns from data owners; (iii) providing DP guarantees for the hyperparameter fine-tuning process and the trained models on the DPDC-generated synthetic datasets, due to the post-processing property. We conduct extensive experiments to evaluate our DPDC algorithms on multiple datasets, including MNIST, FMNIST, CIFAR10, and CelebA. We mainly compare our DPDC algorithms with a nonprivate dataset condensation method, i.e., distribution matching (Zhao & Bilen, 2021a) , and two recent differentially private generative methods, i.e., DP-MERF and DP-Sinkhorn (Cao et al., 2021; Harder et al., 2021) . We demonstrate that (i) LDPDC shows comparable performance to DP-MERF and DP-Sinkhorn despite its simplicity; (ii) NDPDC can provide DP protection with a mild utility loss, compared to distribution matching; (iii) NDPDC allows a flexible trade-off between privacy and utility and can use low DP budgets to achieve better performance than DP-MERF and DP-Sinkhorn.

2. BACKGROUND AND RELATED WORK 2.1 DATASET CONDENSATION

We denote a data sample by x and its label by y. In this paper, we mainly study classification problems, where f θ (•) refers to the model with parameters θ. ℓ(f θ (x), y) refers to the cross-entropy between the model output f θ (x) and the label y. Let T and S denote the original dataset and the synthetic dataset, respectively, then we can formulate the dataset condensation problem as arg min S E (x,y)∼T ℓ(f θ(S) (x), y), where θ(S) = arg min θ E (x,y)∼S ℓ(f θ (x), y), |S| ≪ |T | (1) An intuitive method to solve the above objective is meta-learning (Wang et al., 2018) , with an inner optimization step to update θ and a outer optimization step to update S. However, this intuitive method needs a lot of cost for implicitly using second-order terms in the outer optimization step. Nguyen et al. (2020) considered the classification task as a ridge regression problem and derived an algorithm called kernel inducing points (KIP) to simplify the meta-learning process. Gradient matching is an alternative method (Zhao et al., 2021) for dataset condensation, which minimizes a matching loss between the model gradients on the original and synthetic data, i.e., min S E θ0∼P θ 0 [ I-1 i=1 Π M (∇ θ L S (θ i ), ∇ θ L T (θ i ))]. (2) Π M refers to the matching loss in (Zhao et al., 2021) ; ∇ θ L S (θ i ) and ∇ θ L T (θ i ) refer to the model gradients on the synthetic and original data, respectively; θ i is updated on the synthetic data to obtain θ i+1 . To boost the performance, Zhao & Bilen (2021b) further applied differentiable Siamese augmentation A w (•) with parameters w to the original data samples and the synthetic data samples. Recently, Zhao & Bilen (2021a) proposed to match the feature distributions of the original and synthetic data for dataset condensation. Zhao & Bilen (2021a) used an empirical estimate of maximum mean discrepancy (MMD) as the matching loss, i.e., E θ∼P θ ∥ 1 |T | |T | i=1 Φ θ (A w (x i )) - 1 |S| |S| i=1 Φ θ (A w (x i ))∥ 2 2 , where Φ θ (•) is the feature extractor, and P θ is a parameter distribution. With the help of differentiable data augmentation (Zhao & Bilen, 2021b) , the distribution matching method (DM) (Zhao & Bilen, 2021a) achieves the state-of-the-art performance on dataset condensation.

2.2. DIFFERENTIAL PRIVACY

Differential Privacy (DP) (Dwork et al., 2006) is the most widely-used mathematical definition of privacy, so we first introduce the definition of DP in the following. Based on the framework of DP, Abadi et al. (2016) developed DP-SGD with a moments accountant for learning differentially private models. We also introduce the concept of Rényi Differential Privacy (RDP), as RDP gives us a unified view of pure DP and (ϵ, δ)-DP, graceful composition bounds, and tighter bounds for the (sub)sampled Gaussian mechanism (Definition A.1) (Wang et al., 2019; Mironov et al., 2019) . Due to the benefits of RDP, Meta's Opacus library (Yousefpour et al., 2021 ) also relies on (Mironov et al., 2019) for DP analysis. We begin the brief introduction of RDP with two basic definitions: Definition 2.2 (Rényi Divergence (Rényi et al., 1961) ) Let P and Q be two distributions on Z over the same probability space, the Rènyi divergence between P and Q is D α (P ∥Q) ≜ 1 α -1 ln Z q(z)( p(z) q(z) ) α dz, where p(z) and q(z) are the respective probability density functions of P and Q. Without ambiguity, D α (P ∥Q) can also be written as D α (p(z)∥q(z)). Definition 2.3 (Rényi Differential Privacy (RDP) (Mironov, 2017) ) For two adjacent datasets D and D ′ , if a randomized mechanism M satisfies D α (M(D)∥M(D ′ )) ≤ ϵ (α > 1) , then M obeys (α, ϵ)-RDP, where D α refers to Rényi divergence. We can easily connect RDP and DP by Lemma 2.1 and Corollary 2.1. Lemma 2.1 (RDP to DP Conversion (Balle et al., 2020) ) If a randomized mechanism M guarantees (α, ϵ)-RDP, then it also obeys (ϵ + log((α -1)/α) -(log δ + log α)/(α -1), δ)-DP. In Appendix A (Page 15), we show that Lemma 2.1 is tighter than Mironov (2017)'s conversion law. Corollary 2.1 According to Lemma 2.1, if a mechanism M obeys (α, ϵ(α))-RDP for α > 1, then M obeys (min α>1 (ϵ(α) + log((α -1)/α) -(log δ + log α)/(α -1)), δ)-DP. One main advantage of RDP is that RDP allows a graceful composition of the privacy budgets spent by multiple randomized mechanisms, as illustrated in Lemma 2.2. Lemma 2.2 (RDP Composition (Mironov, 2017) ) If M 1 is (α, ϵ 1 )-RDP, M 2 is (α, ϵ 2 )-RDP, then their composition obeys (α, ϵ 1 + ϵ 2 )-RDP. Furthermore, RDP allows a graceful parallel composition, as shown in Lemma 2.3. Lemma 2.3 (Parallel Composition (Chaudhuri et al., 2019) ) If two datasets D 1 and D 2 are dis- joint (D 1 ∩ D 2 = ∅), and M 1 is (α, ϵ 1 )-RDP, M 2 is (α, ϵ 2 )-RDP, then the combined release (M 1 (D 1 ), M 2 (D 2 )) obeys (α, max(ϵ 1 , ϵ 2 ))-RDP for D 1 ∪ D 2 . In Appendix E, we discuss more related work on differential privacy and generative methods. 

3. DATASET CONDENSATION AND DIFFERENTIAL PRIVACY

P(θ|S) = 1 Z S exp(- |S| i=1 ℓ(f θ (s i ), y i )) Although (Dong et al., 2022) presents the assumption on T , it actually uses the assumption on the synthetic dataset S in the proof of its Proposition 4.10. The original version of Assumption 3.1in (Sablayrolles et al., 2019) , quoted by Dong et al. (2022) , is given below. Assumption 3.2 (Original Assumption in (Sablayrolles et al., 2019) ) Given the training dataset S and loss function ℓ, the distribution of the model parameters is P(θ|S) ∝ exp(- 1 T |S| i=1 ℓ(f θ (s i ), y i )) Comparing the above two assumptions, we observe that Assumption 3.1 has a controversial issue-Given the same loss function and model architecture, the posterior model parameter distribution assumed in Assumption 3.1 is fixed, regardless of the stochasticity and the learning method. Note that in an extreme case, if we use gradient ascent with fixed seeds, the parameter distribution could collapse into a Dirac distribution. In contrast, Eq. 6 in Assumption 3.2 has a temperature hyperparameter T depending on the stochasticity and the learning method, i.e., T → 0 corresponds to MAP, and a small T corresponds to averaged SGD (Polyak & Juditsky, 1992; Sablayrolles et al., 2019) . From this viewpoint, although it is intractable to verify the correctness of Assumption 3.2, it is intuitive that Assumption 3.1 is controversial. Here a natural question to ask is-Could we instead use Assumption 3.2 in the proof of Proposition 4.10 in (Dong et al., 2022 )? The answer is No because when T is small, we could not approximate exp( 1 T |S| i=1 (ℓ(f θ (s i ), y i ) -ℓ(f θ (s ′ i ), y i ))) -1 by 1 T |S| i=1 (ℓ(f θ (s i ), y i ) -ℓ(f θ (s ′ i ), y i )) . Thus, we could not have Eq. 39 in the proof of Proposition 4.10 in (Dong et al., 2022) with Assumption 3.2. In the following, we introduce the other controversial assumption which is implicitly used in the proof of Proposition 4.10 in (Dong et al., 2022 ) (Eq. 35 in (Dong et al., 2022) ). Assumption 3.3 (Implicit Assumption in (Dong et al., 2022) ) With random initialization si ∼ N (0, I d ), the synthetic data s * i that minimizes the distribution matching loss (Eq. 3) is s * i = si + 1 |T | |T | j=1 xj (under E), where the data is represented under an orthogonal basis E = {e 1 , e 2 , ..., e d } with E T = {e 1 , e 2 , ..., e dim(span(T )) } forming the orthogonal basis of span(T ) (Dong et al., 2022) . Let Q be the transformation matrix from E to the standard basis. We then have s * i = Qs i + 1 |T | |T | j=1 x j (under standard basis), where s * i and x j represent s * i and xj under the standard basis. In Appendix C, we detail how to compute the transformation matrix and the synthetic data given in Eq. 8 for each class. Actually, we need to use T c * instead of T to compute the synthetic data for class c. Assumption 3.3 is only proved on linear feature extractors, but it is directly used in the proof of Proposition 4.10 in (Dong et al., 2022 ) (Eq. 35 in (Dong et al., 2022) ), which attempts to prove a general privacy bound for dataset condensation. Unfortunately, Assumption 3.3 is not always true when we use nonlinear feature extractors to learn the synthetic data. We conduct an experiment to verify the above claim: We use the same data initialization to compute a set of 500 synthetic data samples with Eq. 8 (See Appendix C) and learn a set of 500 synthetic data samples by DM with nonlinear extractors on CIFAR10. We then compute the DM loss on the synthetic data by Eq. 3 with nonlinear extractors. The DM loss on the synthetic data computed with Eq. 8 is much higher than the DM loss on the DM-learned synthetic data, which conflicts with Assumption 3.3. We further train two ConvNet models on the two sets of synthetic data, respectively. The model trained on the synthetic data computed by Eq. 8 only has about 25% accuracy † , while the model trained on the DM-generated synthetic data achieves approximately 60% accuracy. Moreover, we test whether the distributions of the synthetic data computed by Eq. 8 and the synthetic data generated by DM with nonlinear extractors may have the same population mean for each class c. Specifically, we define the null hypothesis as H 0 : µ c (synthetic data defined by Eq. 8) = µ c (synthetic data generated by nonlinear DM) We use BS test (Bai & Saranadasa, 1996) since the conventional Hotelling's T 2 test (Hotelling, 1992) is only applicable to the case that d < N 1 + N 2 -2, where d is the data dimension, and N 1 and N 2 refer to the number of samples from the two populations. We compute the test statistic of BS test, denoted by Z BS for the 10 classes of CIFAR10. The smallest Z BS is approximately -3.03 < -z 0.005 ‡ , which gives us sufficient evidence to reject the null hypothesis. Thus, even with the same random initialization, the synthetic data defined by Eq. 8 and the synthetic data generated by DM with nonlinear feature extractors can have completely different distributions. Therefore, it is problematic to use Assumption 3.3 to prove a general privacy bound for dataset condensation.

3.2. DIFFERENTALLY PRIVATE DATASET CONDENSATION (DPDC)

We show that the privacy bound proved by Dong et al. (2022) is problematic due to two controversial assumptions. To correctly connect DC and DP, we propose two differentially private dataset condensation (DPDC) algorithms-a linear algorithm (LDPDC) and a nonlinear algorithm (NDPDC). Linear DPDC (LDPDC) We illustrate LDPDC in Algorithm 1. For each class c, we construct M synthetic data samples {s c j } M j=1 . For each synthetic sample s c j , we randomly sample a subset {x c k } L c j k=1 from T c by Poisson Sampling with probability L/N c . L is the group size (Abadi et al., 2016) , and N c = |T c |. L c j follows a Poisson distribution with expectation L. Similar to the q = L/N in (Abadi et al., 2016) and the Opacus library, the sampling probability q c = L/N c is fixed for each class in the execution of the algorithms-For the adjacent datasets of T c , we still consider q c as the sampling probability, then we could exploit the prior theoretical results on subsampling for DP analysis (similar to Opacus). With {x c k } L c j k=1 , we compute s c j by s c j = 1 L (N (0, σ 2 I d ) + L c j k=1 x c k ) where N (0, σ 2 I d ) refers to Gaussian random noise with standard deviation σ. We do not use a formula similar to Eq. 8 to construct synthetic data because Q leaks private information. (Abadi et al., 2016; Yousefpour et al., 2021) ) . s c j = 1 L (N (0, σ 2 I d ) + L c j k=1 x c k ) end for end for Output the synthetic dataset S = {{s c j } M j=1 } C c=1 Nonlinear DPDC (NDPDC) We illustrate NDPDC in Algorithm 2, which is designed upon the idea of matching the representations of original and synthetic data. We follow (Zhao & Bilen, 2021a) to use differentiable augmentation function A wc (•) § to boost the performance (Φ θ (A wc (•)) is similar to a composite function). In each iteration of Algorithm 2, we first sample random parameters θ for the feature extractor Φ θ (not pretrained) and initialize the loss as 0. After that, for each class c, we sample the augmentation parameters w c and randomly sample a subset from T c by Poisson sampling with sampling probability L/N c . We then compute the representations of the original data and synthetic data and clip the representations with a pre-defined threshold G. We remark that it is essential to clip both the representations of original and synthetic data: We clip the representations of the original data for the purpose of bounding the ℓ 2 sensitivity; We also clip the representations of the synthetic data in order to match the representations on a similar scale. Since G is pre-defined as the constant 1 (not computed from original data), the operation of clipping the synthetic data representations (i.e., r(s c j ) = min(1, G ∥r(s c j )∥2 )r(s c j ) in Algorithm 2) does not leak private information regarding the original data. After clipping the representations, we add Gaussian noise to the sum of the clipped original data representations. We use the squared ℓ 2 distance between the randomized sum of the clipped original data representations (i.e., N (0, σ 2 I) + |Dc| i=1 r(x c i ) ) and the sum of the clipped synthetic data representations multiplied by a factor L/M (i.e., L M M j=1 r(s c j )) as the loss. We use the factor L/M because |Dc| i=1 r(x c i ) sums up |D c | (E(|D c |) = L) representations, while M j=1 r(s c j ) sums up M representations. At the end of each iteration, we update the synthetic data S with the gradient of the loss ℓ with respect to S, similar to Algorithm 1 in (Zhao & Bilen, 2021a) . In practical implementation, following (Zhao et al., 2021; Zhao & Bilen, 2021b; a) , we implement S as a tensor variable with size [N, data shape] (e.g., [N, 3, 32, 32] on CIFAR10), where N is the size of the synthetic dataset. Here a natural question to ask is-Why not combine distribution matching and DP-SGD for differentially private data condensation? For many common deep learning tasks, we could compute sample-wise loss so that DP-SGD can clip the sample-wise loss gradients to bound the sensitivity. However, distribution matching uses the squared ℓ 2 distance between the averaged original data representations and the averaged synthetic data representations as loss, so we could not directly compute sample-wise loss and apply DP-SGD to distribution matching. Compute Representations: r( x c i ) = Φ θ (A wc (x c i )) for the subset D c = {x c i , c} |Dc| i=1 ; r(s c j ) = Φ θ (A wc (s c j )) for S c = {s c j } M j=1 . Norm Clipping: r(s c j ) = min(1, G ∥r(s c j )∥2 )r(s c j ); r(x c i ) = min(1, G ∥r(x c i )∥2 )r(x c i ). Compute Loss: ℓ = ℓ + ∥ L M M j=1 r(s c j ) -(N (0, σ 2 I) + |Dc| i=1 r(x c i ))∥ 2 2 . end for S = S -η∇ S ℓ (s c j = s c j -η∇ s c j ℓ ∀s c j ∈ S). end for Output the synthetic dataset S = {{s c j } M j=1 } C c=1 Theorem 3.1 and Theorem 3.2 bound the privacy risk of Algorithm 1 and Algorithm 2, respectively. Theorem 3.1 Suppose the original data x satisfies x ∈ [-b, b] d , and let Ω q,σ1 (α) ≜ D α ((1 - q)N (0, σ2 1 )+qN (1, σ2 1 )∥N (0, σ2 1 )) with σ1 = σ/(b √ d) and q = max c (L/N c ), then LDPDC obeys (α, M Ω q,σ1 (α))-RDP and (min α>1 (M Ω q,σ1 (α)+log((α-1)/α)-(log δ+log α)/(α-1)), δ)-DP. Theorem 3.2 Let Ω q,σ2 (α) ≜ D α ((1 -q)N (0, σ2 2 ) + qN (1, σ2 2 )∥N (0, σ2 )) with σ2 = σ/G and q = max c (L/N c ), then NDPDC obeys (α, IΩ q,σ2 (α))-RDP and (min α>1 (IΩ q,σ2 (α) + log((α -1)/α) -(log δ + log α)/(α -1)), δ)-DP, where I is the number of iterations. We detail the proof of Theorem 3.1 and Theorem 3.2 in Appendix A. The basic idea is to prove the RDP bound on T c = {x i , c} Nc i=1 , where we only need to consider x i due to the same c, followed by generalizing the theoretical result to T using Lemma 2.3. In the experiments, we follow Opacus to exploit Mironov et al. (2019) 's method for computing Ω q,σ (α). We note that Mironov et al. (2019) did not provide convergence analysis for their computation method when α is a fractional number. For completeness, we present our convergence analysis for the computation method in Appendix B. 85.79% ± 0.81% (1.10, 10 -5 )-DP 63.95% ± 0.42% (1.06, 10 -5 )-DP Nonlinear DPDC (NDPDC) 97.35% ± 0.13% (6.12, 10 -5 )-DP 82.72% ± 0.35% (5.45, 10 25.81% ± 0.52% (1.14, 10 -5 )-DP 68.72% ± 2.26% (0.61, 10 -5 )-DP Nonlinear DPDC (NDPDC) 52.68% ± 0.40% (6.72, 10 -5 )-DP 80.66% ± 0.63% (0.71, 10 -5 )-DP Table 1 : We use the default settings for all the methods. We employ 50 synthetic samples per class to train ConvNet models and report the testing accuracy here. We follow (Zhao & Bilen, 2021a) to apply the augmentations in (Zhao & Bilen, 2021b) when training ConvNet models. According to Table 8 in Appendix D, even using low DP budgets (ϵ < 1), NDPDC still outperforms LDPDC, DP-MERF, and DP-Sinkhorn. Similar to DP-Sinkhorn, we can also fix a target DP budget and compute the corresponding σ (or I) to run LDPDC and NDPDC. 

4.1. EXPERIMENTAL SETUP

We follow (Zhao & Bilen, 2021b; a; Dong et al., 2022) to conduct experiments on four widelyused datasets: MNIST (Deng, 2012) , Fashion-MNIST (Xiao et al., 2017) , CIFAR10 (Krizhevsky et al., 2009) , and CelebA (gender classification) (Liu et al., 2015) . In the following, we introduce the baselines, DPDC settings, and the method for evaluating the synthetic data utility. We provide DPDC's code in the supplementary material, where the readers could find more technical details. Baselines We compare DPDC with the state-of-the-art dataset condensation method, i.e., distribution matching (DM) (Zhao & Bilen, 2021a) , and two recent DP generative methods, i.e., DP-Sinkhorn (Cao et al., 2021) and DP-MERF (Harder et al., 2021) . For DP-Sinkhorn, we use Cao et al. ( 2021)'s code ¶ to run the experiments. We set m to 1 and the target ϵ to 10. For DP-MERF, we use Harder et al. ( 2021)'s code || and follow (Harder et al., 2021) to set σ to 5. DPDC Settings For LDPDC, we set σ = √ d, M = 50, L = 50 by default. Given that b = 1, σ1 = σ/(b √ d) = 1. For NDPDC, the default settings are σ = 1, G = 1, M = 50, L = 50, I = 10000, and η = 1. We set G = 1, so σ2 = σ/G = 1. We follow (Zhao & Bilen, 2021a) to use three-layer convolutional neural networks as the feature extractors (also called ConvNetbased feature extractors) for NDPDC. Batch normalization (BN) (Ioffe & Szegedy, 2015) is not DP friendly since a sample's normalized value depends on other samples (Yousefpour et al., 2021) . Therefore, we do not use BN in the extractors. Since the data statistics like channel-wise means and channel-wise standard deviation may leak private information, we follow (Cao et al., 2021) to use a fixed value 0.5 for normalizing the images, which does not make obvious difference in the performance of DPDC and the baselines. After normalization, the pixel values range from -1 to 1.

Performance Evaluation

We employ the evaluation method in (Zhao & Bilen, 2021b; a; Dong et al., 2022) to compare the performance of DM, DP-Sinkhorn, DP-MERF, and our DPDC algo- 1 . rithms. The evaluation method is to train deep learning models on the synthetic data from scratch and compare their accuracy on the real testing data. Higher testing accuracy indicates better synthetic data utility for training deep learning models (better performance). We also follow (Zhao & Bilen, 2021b; a) to train a variety of model architectures, including MLP (Haykin, 1994) , LeNet (LeCun et al., 1998) , AlexNet (Krizhevsky et al., 2017) , VGG11 (Simonyan & Zisserman, 2014) , and ResNet18 (He et al., 2016) , on DPDC-generated synthetic data to evaluate the data utility.

4.2. MAIN RESULTS

We report the main experimental results in Table 1 : Our LDPDC algorithm achieves comparable performance to DP-MERF and DP-Sinkhorn with low DP budgets. Our NDPDC algorithm provides acceptable DP guarantees (ϵ < 10) with a mild utility loss, compared to the random-initialized non-private DM method (Zhao & Bilen, 2021a) . Furthermore, NDPDC allows a flexible trade-off between synthetic data utility and DP budget-If we are not satisfied with NDPDC's DP budgets in Table 1 , we could increase σ to reduce the budget. As shown in Table 2 , even with low DP budgets ϵ = 1, NDPDC still outperforms LDPDC, DP-Sinkhorn, and DP-MERF. For DP-MERF, even if we decrease σ to 0.5 (ϵ > 10), the accuracy increment is only about 7% on FMNIST and less than 5% on the other datasets, as shown in Table 9 in Appendix D. We conjecture that a small amount of NDPDC-condensed synthetic data is more useful than a small amount of DP-generator generated synthetic data because DP generative methods optimize the generative model parameters, while NDPDC directly optimizes the small amount of synthetic data. It is worth noting that DPMix (Lee et al., 2019) is a recent linear DP data publishing method, which seems similar to LDPDC. However, LDPDC and DPMix still have some differences as discussed in Appendix F, indicating that LDPDC may be better than DPMix. Compared to LDPDC and DPMix, NDPDC is more suitable for solving nonlinear problems such as image recognition. According to the results in (Lee et al., 2019) , NDPDC outperforms DPMix by 13%-24% in testing accuracy on MNIST and CIFAR10. We further train a variety of model architectures on the synthetic data generated by LDPDC and NDPDC and report the testing accuracy in Table 3 . Since LDPDC does not rely on deep networks to learn the synthetic data, it is hard to predict which network architecture can make the best use of the LDPDC-generated synthetic data. According to the results in Table 3 , on FMNIST, CIFAR10, and CelebA, MLP makes the best use of the simple LDPDC-generated synthetic data, while on MNIST, ResNet18 makes the best use of the synthetic data. For NDPDC, since the synthetic data is learned on ConvNet-based extractors, ConvNet makes better use of the synthetic data than the other architectures. Besides, we visualize the synthetic images in Appendix D: NDPDC-generated images look more diverse and more useful than the synthetic images generated by DP-Sinkhorn, DP-MERF, and DPMix (Fig. 1 in (Lee et al., 2019) provides synthetic images generated by DPMix.).

4.3. ABLATION STUDY ON NDPDC

Although LDPDC is simple and comparable to recent DP generative methods here, we still recommend the readers to use NDPDC in practice because of its outstanding performance. In this subsection, we conduct ablation study for NDPDC on MNIST, FMNIST, and CIFAR10 with recommendations on how to select hyperparameters for executing NDPDC. When we study the effects of one hyperparameter, we fix the other hyperparameters as the default settings. 7 : As M increases from 10 to 200, the testing accuracy also increases but will almost stop the uptrend around a certain M . This is probably because more synthetic data has the potential to capture more information, but the DP guarantee also limits the information leaked from the original data. Since the DP budget is unchanged, after the amount of synthetic data is enough for capturing all the limited information, more synthetic data could not capture more useful information. Given the experimental results, M = 50 is a good setting here. We do not recommend setting a larger M (M > 50), which will result in marginal performance gain but much more cost in the downstream applications.

5. CONCLUSION

In this paper, we revisit a recent proposal in ICML'22 on privacy-preserving dataset condensation and reveal that its proposed privacy bound is problematic because of two controversial assumptions. To correctly connect data condensation and differential privacy, we propose two differentially private dataset condensation (DPDC) algorithms-LDPDC and NDPDC. We demonstrate that LDPDC can use low DP budgets to achieve comparable performance to DP-Sinkhorn and DP-MERF. Moreover, we show that NDPDC can provide DP guarantees with a mild utility loss compared to the distribution matching method. We hope our work can inspire further research in this potential direction to alleviate the cost burden and privacy concern in deep learning.

A OMITTED PROOF

We first prove Lemma A.1. Based on Lemma A.1, we can easily prove Corollary A.1, which will be used in the proof of Theorem 3.1 & 3.2. Lemma A.1 Let u(z) and ν(z) be two differentiable probability density functions on a domain Z (u, ν : Z → R). If u(z) ̸ = ν(z) and u(z), ν(z) > 0 on Z, then D α ((1 -q)u(z) + qν(z)∥u(z)) is an increasing function w.r.t. q when α > 1 and q ∈ [0, 1]. Lemma A.1 is easy to understand: As q increases, the weight of u(z) in the mixture (1 -q)u(z) + qν(z) decreases, thus the divergence between (1 -q)u(z) + qν(z) and u(z) should increase. To our best knowledge, we are the first to present Lemma A.1, so we detail the proof in the following. Proof [proof of Lemma A.1] The Rényi divergence D α ((1 -q)u(z) + qν(z)∥u(z)) is defined as 1 α -1 ln Z u(z)( (1 -q)u(z) + qν(z) u(z) ) α dz = 1 α -1 ln Z u(z)(1 -q + q ν(z) u(z) ) α dz (9) The derivative of Eq. 9 w.r.t q is 1 (α -1) Z u(z)(1 -q + q ν(z) u(z) ) α dz Z α(ν(z) -u(z))(1 -q + q ν(z) u(z) ) α-1 dz, To prove Lemma A.1, we need to show Eq. 10 is positive when q ∈ [0, 1]. Since α > 1 and u(z)(1 -q + q ν(z) u(z) ) α > 0 (If q ̸ = 0, 1 -q + q ν(z) u(z) > 1 -q ≥ 0), we only need to prove Z (ν(z) -u(z))(1 -q + q ν(z) u(z) ) α-1 dz > 0. We divide Z into Z 1 and Z 2 , where Z 1 = {z ∈ Z|ν(z) < u(z)} and Z 2 = {z ∈ Z|ν(z) ≥ u(z)}. Apparently, Z 1 and Z 2 are disjoint, and Z = Z 1 ∪ Z 2 . Thus, we can rewrite Z (ν(z) -u(z))(1 - q + q ν(z) u(z) ) α-1 dz as Z1 (ν(z) -u(z))(1 -q + q ν(z) u(z) ) α-1 dz + Z2 (ν(z) -u(z))(1 -q + q ν(z) u(z) ) α-1 dz (11) When z ∈ Z 1 , we have (i) ν(z) -u(z) < 0; (ii) ν(z) u(z) < 1 (0 < ν(z) < u(z)); (iii) 0 < (1 -q + q ν(z) u(z) ) α-1 < 1 (1 = 1 -q + q > 1 -q + q ν(z) u(z) > 1 -q ≥ 0). Therefore, Z1 (ν(z) -u(z))(1 -q + q ν(z) u(z) ) α-1 dz > Z1 (ν(z) -u(z))dz (12) When z ∈ Z 2 , we have (i) ν(z) -u(z) ≥ 0; (ii) ν(z) u(z) ≥ 1 (0 < u(z) ≤ ν(z)); (iii) (1 -q + q ν(z) u(z) ) α-1 ≥ 1. Therefore, Z2 (ν(z) -u(z))(1 -q + q ν(z) u(z) ) α-1 dz ≥ Z2 (ν(z) -u(z))dz (13) As a result, Z (ν(z) -u(z))(1 -q + q ν(z) u(z) ) α-1 dz > Z1 (ν(z) -u(z))dz + Z2 (ν(z) -u(z))dz = Z (ν(z) -u(z))dz = Z ν(z)dz - Z u(z)dz = 0 Thus, the derivative of Eq. 9 w.r.t q is positive. This concludes the proof of Lemma A.1. Corollary A.1 Let Ω qc,σ (α) ≜ D α ((1 -q c )N (0, σ 2 ) + q c N (1, σ 2 )∥N (0, σ 2 )) , where c = 1, 2, ..., C. We have max c Ω qc,σ (α) = Ω maxc(qc),σ (α). Proof [proof of Corollary A.1] Let u(z) ≜ N (0, σ 2 ) and ν(z) ≜ N (1, σ 2 ) (Z ≜ R), then based on Lemma A.1, we know that Ω q,σ (α) is an increasing function w.r.t. q. Thus, the maximum of Ω qc,σ (α) is achieved at max c (q c ). This concludes the proof of Corollary A.1. In addition to Corollary A.1, we also need to use a definition and a lemma from (Mironov et al., 2019) in the proof of Theorem 3.1 & 3.2: Mironov et al., 2019) ) Let f be a function mapping subsets of a dataset D to R n . Mironov et al. (2019) define Sampled Gaussian Mechanism (SGM) as Mironov et al., 2019) ) Given the notations in Definition A.1, if for any two adjacent subsets D 1 and D 2 sampled from D, ∥f Definition A.1 (( SGM q,σ (D) = f (a subset sampled from D with probability q) + N (0, σ 2 I n ) (14) Lemma A.2 (( (D 1 ) -f (D 2 )∥ ≤ 1, then SGM q,σ (D) obeys (α, ϵ)-RDP, where ϵ = D α ((1 -q)N (0, σ 2 ) + qN (1, σ 2 )∥N (0, σ 2 )). Proof [proof of Lemma A.2] Given the proof of Theorem 4 in (Mironov et al., 2019) , we know that SGM q,σ (D) obeys (α, ϵ)-RDP, where ϵ could be max{D α ((1 -q)N (0, σ 2 ) + qN (1, σ 2 )∥N (0, σ 2 )), D α (N (0, σ 2 )∥(1 -q)N (0, σ 2 ) + qN (1, σ 2 ))}. According to Theorem 5 in (Mironov et al., 2019) , D α ((1 -q)N (0, σ 2 ) + qN (1, σ 2 )∥N (0, σ 2 )) ≥ D α (N (0, σ 2 )∥(1 -q)N (0, σ 2 ) + qN (1, σ 2 )). Thus, ϵ = Ω q,σ (α), where Ω q,σ (α) ≜ D α ((1 -q)N (0, σ 2 ) + qN (1, σ 2 )∥N (0, σ 2 )). Eventually, we can present the proof of our theorems. Proof [proof of Theorem 3.1] Define g(D) = |D| i=1 x i , where D = {x i , c} |D| i=1 ; g(∅) = 0. For any two adjacent datasets D, D ′ ⊂ T c (D ′ = D ∪ {x ′ }), we have ∥g(D) -g(D ′ )∥ 2 = ∥x ′ ∥ 2 ≤ b √ d. (15) Thus, ∥ 1 b √ d g(D) -1 b √ d g(D ′ )∥ 2 ≤ 1. In Algorithm 1, a synthetic data sample can be represented by s c i = b √ d L ( 1 b √ d g(D c ) + N (0, σ2 1 I)), where σ1 = σ/(b √ d) (16) where D c is a subset sampled from T c with by Poisson sampling with sampling probability q c = L/N c . Apparently, SG q,σ1 (T c ) = 1 b √ d g(D c ) + N (0, σ2 1 I) is a sampled Gaussian mechanism (SGM), and ∥ 1 b √ d g(D) -1 b √ d g(D ′ )∥ 2 ≤ 1. Given Lemma A.2, SG q,σ1 (T c ) obeys (α, Ω qc,σ1 (α))-RDP with σ1 = σ/(b √ d) and q c = L/N c . According to the post-processing property, s c also guarantees (α, Ω qc,σ1 (α))-RDP. Since Algorithm 1 outputs M synthetic samples for each class c, according to Lemma 2.2, Algorithm 1 guarantees (α, M Ω qc,σ1 (α))-RDP for T c . Since T 1 , T 2 , ... T C are disjoint, according to Lemma 2.3, Algorithm 1 guarantees (α, max c (M Ω qc,σ1 (α)))-RDP for T = T 1 ∪ T 2 ... ∪ T C . Corollary A.1 indicates that max c (M Ω qc,σ1 (α)) = M Ω maxc(qc),σ1 (α). Therefore, Algorithm 1 obeys (α, M Ω q,σ1 (α))-RDP with q = max c (L/N c ) and σ1 = σ/(b √ d). Proof [Proof of Theorem 3.2] For D = {x i , y i } |D| i=1 , define g(D) = |D| i=1 min(1, G ∥r(xi)∥2 )r(x i ), where r(x i ) = Φ θ (A wc (x i )) . Also, define g(∅) = 0. For any two adjacent subsets D, D ′ ⊂ T c , and D ′ = D ∪ {x ′ }, we have ∥g(D) -g(D ′ )∥ 2 = ∥x ′ ∥ 2 = ∥ min(1, G ∥r(x ′ )∥ 2 )r(x ′ )∥ 2 ≤ G. (17) Thus, ∥ 1 G g(D) -1 G g(D ′ )∥ 2 ≤ 1. In each iteration of Algorithm 2, we rewrite N (0, σ 2 I) + |Dc| i=1 r(x c i ) as G(N (0, σ2 2 I) + 1 G g(D c )), where σ2 = σ/G (18) Apparently, SG qc,σ2 (T c ) = 1 G g(D c )+N (0, σ2 2 I) with σ2 = σ/G and q c = L/N c is also a sampled Gaussian mechanism (SGM). Since ∥ 1 G g(D) -1 G g(D ′ )∥ 2 ≤ 1, SG qc,σ2 (T c ) obeys (α, Ω qc,σ2 (α))- RDP, where Ω qc,σ2 (α) ≜ D α ((1 -q c )N (0, σ2 2 ) + q c N (1, σ2 2 )∥N (0, σ2 )). Given the post-processing property, N (0, σ 2 I) + g(D c ) also guarantees (α, Ω qc,σ2 (α))-RDP with q c = L/N c and σ2 = σ/G. According to Lemma 2.2, Algorithm 2 guarantees (α, IΩ qc,σ2 (α))-RDP for T c . Since T 1 , T 2 , ... T C are disjoint, based on Lemma 2.3, we know that Algorithm 2 guarantees (α, max c (IΩ qc,σ2 (α)))- RDP for T = T 1 ∪ T 2 ... ∪ T C . Corollary A.1 indicates that max c (IΩ qc,σ2 (α)) = IΩ maxc(qc),σ2 (α). Thus, Algorithm 2 obeys (α, IΩ maxc(qc),σ2 (α))-RDP with q = max c (L/N c ) and σ2 = σ/G. Here we also prove that Lemma 2.1 is tighter than Mironov (2017)'s conversion law. Proof We first provide Mironov (2017)'s conversion law: (Mironov, 2017) ) If a randomized mechanism M guarantees (α, ϵ)-RDP, then it also obeys (ϵ + log(1/δ)/(α -1), δ)-DP. Lemma A.3 (RDP to DP Conversion Since (α -1)/α < 1, log((α -1)/α) < 0. Since α > 1, log α > 0, and thus -(log α)/(α -1) < 0. Combining log((α -1)/α) < 0 and -(log α)/(α -1) < 0, we have log((α -1)/α) -(log α)/(α -1) < 0 when α > 1. ( ) We add ϵ + log(1/δ)/(α -1) to both sides of the above inequality and obtain ϵ + log((α -1)/α) -(log δ + log α)/(α -1) < ϵ + log(1/δ)/(α -1) when α > 1. Therefore, Lemma 2.1 is a tighter conversion law compared to Lemma A.3.

B COMPUTATION METHOD

FOR Ω q,σ In this section, we briefly introduce the computation method in (Mironov et al., 2019) , which is used in Meta's Opacus library (Yousefpour et al., 2021) . Mironov et al. (2019) did not provide convergence analysis on the infinite series for computing the integral when α is a fractional number. So for completeness, we provide detailed convergence proof in this section. Let u(z) ≜ N (0, σ 2 ) and ν(z) ≜ N (1, σ 2 ), we can express Ω q,σ (α) as Ω q,σ (α) = 1 α -1 ln +∞ -∞ u(z)(1 -q + q ν(z) u(z) ) α dz = 1 α -1 ln E z∼u(z) [(1 -q + q ν(z) u(z) ) α ] If α is an integer, the integral can be expressed as the sum of a finite number of terms by applying binomial expansion, i.e., (1 -q + q ν(z) u(z) ) α = α k=0 (1 -q) α-k (q ν(z) u(z) ) k = α k=0 α k (1 -q) α-k q k ( ν(z) u(z) ) k . Thus, Ω q,σ (α) can be rewritten as Ω q,σ (α) = 1 α -1 ln E z∼u(z) [ α k=0 α k (1 -q) α-k q k ( ν(z) u(z) ) k ] = 1 α -1 ln α k=0 α k (1 -q) α-k q k E z∼u(z) [( ν(z) u(z) ) k ]. The remaining problem is to compute E z∼u(z) [( ν(z) u(z) ) k ]. Given that u(z) = 1 σ √ 2π exp(-z 2 2σ 2 ) and ν(z) = 1 σ √ 2π exp(-(z-1) 2 2σ 2 ), it is not hard to derive that E z∼u(z) [( ν(z) u(z) ) k ] = exp( k 2 -k 2σ 2 ). Therefore, there is an analytical solution to Ω q,σ (α) when α is an integer, i.e., Ω q,σ (α) = 1 α -1 ln α k=0 α k (1 -q) α-k q k exp( k 2 -k 2σ 2 ). ( ) If α is a fractional number (α ∈ R), then we need to rely on the general binomial theorem to expand (1 -q + q ν(z) u(z) ) α . Lemma B.1 (General Binomial Theorem) If α, γ ∈ R and |β| < 1, then (1 + γ) α can be ex- pressed as a convergent series, i.e., (1 + γ) α = ∞ k=0 α k γ k , where α k = k-1 m=0 (α-m) k! . Based on Lemma B.1, we have the following corollary: Corollary B.1 If α, β, γ ∈ R and |γ| < |β|, then (β + γ) α can also be expressed as a convergent series, i.e., β α (1 + γ β ) α = β α ∞ k=0 α k ( γ β ) k = ∞ k=0 α k γ k β α-k , where α k = k-1 m=0 (α-m) k! . Given that z 0 = 1 2 + σ 2 ln(q -1 -1) (z 1 in (Mironov et al., 2019) ), 1 -q > q ν(z) u(z) when z < z 0 , and 1 -q < q ν(z) u(z) when z > z 0 . According to Corollary B.1, we have the following convergent series: (1 -q + q ν(z) u(z) ) α = ∞ k=0 α k (1 -q) α-k q k ( ν(z) u(z) ) k when z < z 0 ∞ k=0 α k (1 -q) k q α-k ( ν(z) u(z) ) α-k when z > z 0 In the following, we prove that the two series in Eq. 26 are also convergent at z = z 0 . Proof [Eq. 26 is convergent at z = z 0 ] If we substitute z in Eq. 26 with z 0 (1 -q = q ν(z0) u(z0) ), we can obtain ∞ k=0 α k (1 -q) α-k q k ( ν(z 0 ) u(z 0 ) ) k = ∞ k=0 α k (1 -q) k q α-k ( ν(z 0 ) u(z 0 ) ) α-k = (1 -q) α ∞ k=0 α k . We first show that lim k→+∞ | α k | = 0 by representing | α k | as a product of three partitions, i.e., | α k = 1 k | • | ⌊α⌋ m=0 (α -m) ⌊α⌋! | • | k-1 m=⌊α⌋+1 α -m m | (28) When m ≥ ⌊α⌋ + 1, |α -m| = m -α < m. Thus, | α-m m | < 1 when m ≥ ⌊α⌋ + 1. So we have | k-1 m=⌊α⌋+1 α-m m | < 1. Since | ⌊α⌋ m=0 (α-m) ⌊α⌋! | is a finite number and lim k→+∞ 1 k = 0, we have lim k→+∞ | α k | = 0. In summary, the series (1-q) α ∞ k=0 α k satisfies the following conditions: (i) lim k→+∞ | α k | = 0 (ii) When k ≥ ⌊α⌋ + 1 (or k > α), α k and α k+1 have different signs (iii) | α k+1 | < | α k | since | α k+1 |/| α k | = | α-k k+1 | < 1. Given the above conditions, according to the alternating series test, we can conclude that ∞ k=0 α k is a convergent series with α k = k-1 m=0 (α-m) k! . Therefore, we can rewrite Eq. 26 as (Eq. 12 in (Mironov et al., 2019) ) (1 -q + q ν(z) u(z) ) α = ∞ k=0 α k (1 -q) α-k q k ( ν(z) u(z) ) k when z ≤ z 0 ∞ k=0 α k (1 -q) k q α-k ( ν(z) u(z) ) α-k when z ≥ z 0 As a result, we could rewrite Ω q,σ (α) as Ω q,σ (α) = 1 α -1 ln( z0 -∞ u(z) ∞ k=0 α k (1 -q) α-k q k ( ν(z) u(z) ) k dz + +∞ z0 u(z) ∞ k=0 α k (1 -q) k q α-k ( ν(z) u(z) ) α-k dz) = 1 α -1 ln( ∞ k=0 α k (1 -q) α-k q k z0 -∞ u(z)( ν(z) u(z) ) k dz + ∞ k=0 α k (1 -q) k q α-k +∞ z0 u(z)( ν(z) u(z) ) α-k dz) According to (Mironov et al., 2019) , we could compute the integrals in the above series by z0 -∞ u(z)( ν(z) u(z) ) k dz = 1 2 exp( k 2 -k 2σ 2 )erfc( k -z 0 √ 2σ ) +∞ z0 u(z)( ν(z) u(z) ) α-k dz = 1 2 exp( (α -k) 2 -(α -k) 2σ 2 )erfc( z 0 -(α -k) √ 2σ ). The remaining problem is to prove that Eq. 30 is convergent. Since Mironov et al. (2019) did not provide convergence analysis, for completeness, we detail the convergence proof in the following. Proof [Eq. 30 is convergent] We first prove that the first half in ln(•) in Eq. 30, i.e., ∞ k=0 α k (1 - q) α-k q k z0 -∞ u(z)( ν(z) u(z) ) k dz is a convergent series. We rewrite the series as (1 -q) α ∞ k=0 α k ( q 1 -q ) k z0 -∞ u(z)( ν(z) u(z) ) k dz (32) Given that z 0 = 1 2 + σ 2 ln(q -1 -1), we have exp( 2kz 0 -k 2σ 2 ) = exp(k ln(q -1 -1)) = ( 1 -q q ) k Thus, ( q 1 -q ) k z0 -∞ u(z)( ν(z) u(z) ) k dz = 1 σ √ 2π z0 -∞ exp{ -z 2 + 2k(z -z 0 ) 2σ 2 }dz. Since z -z 0 ≤ 0 for z ∈ (-∞, z 0 ], exp( 2(z-z0) 2σ 2 ) ≤ 1, and thus, 1 σ √ 2π z0 -∞ exp{ -z 2 + 2(k + 1)(z -z 0 ) 2σ 2 }dz ≤ 1 σ √ 2π z0 -∞ exp{ -z 2 + 2k(z -z 0 ) 2σ 2 }dz. (35) Therefore, according to the alternating series test, we conclude that the second half in ln(•) in Eq. 30, i.e., ∞ k=0 α k (1 -q) k q α-k +∞ z0 u(z)( ν(z) u(z) ) α-k dz is a convergent series. All in all, Eq. 30 is convergent. The practical computation of Ω q,σ (α) when α is a fractional number proceeds by computing u(z)( ν(z) u(z) ) α-k dz with Eq. 31 and then plugging in the results into the series Eq. 30 until convergence (Mironov et al., 2019) . C SYNTHETIC DATA DEFINED BY EQ. 8 According to (Dong et al., 2022) , the synthetic data defined by Eq. 7 is represented under the orthogonal basis E = {e 1 , e 2 , ..., e d }. Thus, we need to transform the synthetic data back to the standard basis with Eq. 8. Here we explain how to compute the synthetic dataset defined by Eq. 8. 

D ADDITIONAL EXPERIMENTAL RESULTS

We visualize synthetic images generated by DM, DP-Sinkhorn, DP-MERF, LDPDC, and NDPDC in Fig. 2 and Fig. 3 . Compared to DM-generated images, NDPDC-generated images are more noisy due to the DP guarantees. A surprising result is that LDPDC-generated images look like noise but the models still can learn some information from LDPDC-generated images. We conjecture that this is because LDPDC-generated images still have certain patterns, but the patterns are hardly perceptible by human beings because of the high noise (σ = √ d). We remark that, although the synthetic images generated by DP-Sinkhorn on CelebA look like faces, they are not very colorful and not diverse. Therefore, when being tested on the colorful and diverse original CelebA images, the model trained on NDPDC-generated images has better accuracy than the model trained on DP-Sinkhorn generated images.

Dataset

Test Acc DP budget MNIST 86.70% ± 2.07% (11.60, 10 -5 )-DP FMNIST 70.38% ± 0.79% (11.60, 10 -5 )-DP CIFAR10 20.61% ± 0.87% (11.60, 10 -5 )-DP CelebA 69.51% ± 1.69% (11.60, 10 -5 )-DP Table 9 : The performance of DP-MERF with σ = 0.5. We employ 50 synthetic samples per class to train the ConvNet models for evaluation. Beyond the visualization results, we report the results of NDPDC with high σ and low DP budgets in Table 8 . We set σ set to 0.5 and report the results of DP-MERF (Harder et al., 2021) in Table 9 . We also evaluate DP-HP using Vinaroz et al. ( 2022)'s publicly available code ** and report the results on MNIST and FMNIST in Table 10 . Test Acc DP budget MNIST 74.20% ± 1.66% (1, 10 -5 )-DP FMNIST 28.05% ± 1.12% (1, 10 -5 )-DP Table 10 : The performance of DP-HP. We follow the instructions and run the code from https:// github.com/ParkLabML/DP-HP to generate synthetic data. We employ 50 synthetic samples per class to train the ConvNet models for evaluation.

E EXTENDED RELATED WORK

Generative Methods The previous literature has proposed an array of generative methods for synthetic data generation (Kingma & Welling, 2013; Goodfellow et al., 2014; Mirza & Osindero, 2014; Higgins et al., 2016; Arjovsky et al., 2017; Brock et al., 2018) . Due to the growing privacy concern, recent research also focuses on developing differentially private generative methods (Xie et al., 2018; Jordon et al., 2018; Torkzadehmahani et al., 2019; Long et al., 2021) . Xie et al. (2018) first combined DP-SGD and GAN to generate private synthetic data. Torkzadehmahani et al. (2019) combined conditional GAN and DP-SGD to generate class-conditional private data. Jordon et al. (2018) applied PATE (Papernot et al., 2016) to GAN and developed a differentially private GAN framework called PATE-GAN. PATE-GAN trains a student discriminator on the labels output by the PATE mechanism and trains the generator on the generative loss computed over the student discriminator. (Long et al., 2021) proposed a framework called G-PATE with a private gradient aggregation mechanism to enable a better combination of PATE and GAN. GS-WGAN (Chen et al., 2020) proposed to selectively apply the randomized mechanism in DP-SGD to maximally preserve the true gradient direction and use the Wasserstein objective to improve the amount of gradient information flow during training the generative models. DP-MERF (Harder et al., 2021) proposed to train the generator by matching the mean embeddings of the real data and the generator-output ** https://github.com/ParkLabML/DP-HP synthetic data. The main differences between DP-MERF and NDPDC include (i) NDPDC uses neural network based feature extractors to compute the representations, while DP-MERF uses random Fourier features to compute the embeddings; (ii) NDPDC directly optimizes on the synthetic data, while DP-MERF optimizes the generative model parameters; (iii) NDPDC and DP-MERF apply the Gaussian mechanism in a different way. DP-Shinkhorn (Cao et al., 2021) framed the generative learning problem as minimizing the optimal transport distance and trained the generative models using a semi-debiased Sinkhorn loss. Cao et al. (2021) demonstrated that, using (10, 10 -5 )-DP budget, DP-Shinkhorn can generate synthetic data with better utility and quality than G-PATE and GS-WGAN on MNIST and FashionMNIST. Differential Privacy In addition to DP and RDP, the prior research has proposed some other differential privacy notations, such as Concentrated Differential Privacy (CDP) (Dwork & Rothblum, 2016) , zero CDP (Bun & Steinke, 2016) , and truncated CDP (Bun et al., 2018) . CDP (Dwork & Rothblum, 2016 ) is a relaxation of DP-An algorithm obeys (µ, τ )-CDP if privacy loss random variable has mean µ, and its deviation from µ is subgaussian with standard τ . Zero CDP (Bun & Steinke, 2016 ) is an alternative formulation of CDP, and truncated CDP is a relaxation of zero CDP. Beyond proposing new notations for enhancing privacy analysis, the prior literature also studied privacy amplification by (sub)sampling, which was first proposed in (Kasiviswanathan et al., 2011; Beimel et al., 2013) . The randomness introduced by sampling benefits the analysis in (Bassily et al., 2014; Foulds et al., 2016) . Recently, Wang et al. (2019) ; Mironov et al. (2019) studied the (sub)sampled Gaussian mechanism-a combination of subsampling and the Gaussian mechanism and delivered privacy analysis under the framework of RDP. We follow Opacus (Yousefpour et al., 2021) to exploit (Mironov et al., 2019) 's results in our proof and experiments and provide detailed convergence analysis for (Mironov et al., 2019) 's computation method.

F ADDITIONAL DISCUSSION

Additional Discussion on LDPDC and DPMix Similar to LDPDC, DPMix (Lee et al., 2019 ) is a linear algorithm for differentially private data generation. However, there are some differences between LDPDC and DPMix: 1. LDPDC does not need to randomize the labels with the help of the parallel composition law, but DPMix needs to randomize the labels. We note that adding noise to the labels may hurt the model performance. 2. LDPDC adds noise to the sum of samples and divides it by the fixed group size L, while DPMix directly adds noise to the mean of the samples. For LDPDC, dividing the randomized result by the group size L could help dilute the negative effects of random noise on model performance. Actually, we find that, for LDPDC, if we directly add noise to mean of samples like DPMix, the performance of LDPDC will become much worse. 3. DPMix uses sampling without replacement, while LDPDC uses Poisson sampling. Note that Poisson sampling is the standard sampling method used in the state-of-the-art Pytorch library for differentially private deep learning (Yousefpour et al., 2021) . We have reproduced DPMix and observe that LDPDC has better performance than DPMix with the settings for dataset condensation. Specifically, we set σ X = σ Y = 1, L = 50, and M = 50, the result that we reproduce for DPMix is only 10.22% ± 1.23% on CIFAR10. We conjecture that this may be because (i) The operations of adding noise to the labels and the mean of the samples in DPMix cause more negative effects on model performance, compared to LDPDC. (ii) DPMix may be only able to use a large number of synthetic samples to achieve the results reported in (Lee et al., 2019) . (iii) There are some missing details in the published version of (Lee et al., 2019) that may affect the effectiveness of DPMix. If we instead refer to the results reported in (Lee et al., 2019) , we also observe that LDPDC can use lower DP budgets to achieve comparable performance to DPMix.

General DP Data Generation vs. DP Dataset Condensation

There is a major difference between general DP data generation methods and DP dataset condensation methods: General DP data generation methods typically target at training generative models to generate new synthetic data samples, while DP dataset condensation methods aim to condense the original dataset into a small synthetic dataset and maintain the data utility for training models. To achieve the respective goals, DP generative methods optimize the generative model parameters, while NDPDC directly optimizes the small synthetic dataset. Thus, DP data generation methods may be more useful for generating new data samples with the learned generators, but DP dataset condensation is more useful for dataefficient learning (Zhao & Bilen, 2021a; Zhao et al., 2021; Zhao & Bilen, 2021b) . This is because DP dataset condensation significantly reduces the cost of data storage and model training, and the models trained on a small amount of data produced by DP dataset condensation methods have better utility than the models trained on a small amount of data generated by DP generative methods. Practical Use of DP Dataset Condensation A useful and practical application of DP dataset condensation methods is: Data owners with privacy concerns could condense their datasets into small synthetic datasets by DP dataset condensation methods. With DP protection, the data owners would feel more comfortable to share the synthetic datasets with other users or devices. Even if some users or devices do not have much computational resource or storage, they could store the small synthetic datasets and train models with not bad utility. The data owners could also share their data with trusted entities such as governments, hospitals, or banks. To mitigate the data owners' concerns, the trusted entities can execute DP dataset condensation methods on the collected data and share the small synthetic datasets with other users or devices. In this application, DP dataset condensation is a better choice than general DP data generation methods since the models trained on the small synthetic datasets produced by DP dataset condensation methods have better utility than the models trained on the data generated by DP generation methods.



† Similar to the accuracy achieved by DM with linear extractors reported in(Dong et al., 2022)' appendix. § We refer the interested readers to(Zhao & Bilen, 2021b) for more details on Aw c (•). ¶ https://github.com/nv-tlabs/DP-Sinkhorn_code || https://github.com/ParkLabML/DP-MERF/tree/master/code_balanced



Definition 2.1 (Differential Privacy (DP)) For two adjacent datasets D and D ′ , and every possible output set O, if a randomized mechanism M satisfies P[M(D) ∈ O] ≤ e ϵ P[M(D ′ ) ∈ O] + δ, then M obeys (ϵ, δ)-DP.

REVISIT THE ICML'22 WORK (DONG ET AL., 2022) Dong et al. (2022) first established a connection between DC and DP by Proposition 4.10 in their paper, which is unfortunately proved upon two controversial assumptions. One explicit controversial assumption is regarding the posterior model parameter distribution, as stated in Assumption 3.1. Assumption 3.1 (Assumption 4.8 in (Dong et al., 2022)) Given the training dataset S and loss function ℓ, the distribution of the model parameters is

Linear Differentially Private Dataset Condensation (LDPDC) Require: Original Dataset T = T 1 ∪ T 2 ... ∪ T C ; the number of classes C; number of data samples per class N c ; number of synthetic samples per class M ; group size L. for each class c do for j = 1 to M do Take a randomly sampled subset D c = {x c k , c} L c j k=1 from T c with sampling probability L/N c (by Poisson Sampling, similar to

Nonlinear Differentially Private Dataset Condensation (NDPDC) Require: Original Dataset T = T 1 ∪ T 2 ... ∪ T C ; the number of classes C; the number of data samples per class N c ; the number of synthetic samples per class M ; feature extractors Φ θ (not pretrained); parameter distribution P θ ; group size L; number of iterations I. Initialize S = {{s c j } M j=1 } C c=1 with random noise from N (0, I d ) for each iteration (total number of iterations is I) do Randomly sample θ from P θ and initialize the loss as ℓ = 0 for each class c do Sample the augmentation parameters w c . Take a randomly sampled subset D c from T c with sampling probability L/N c (by Poisson Sampling, similar to(Abadi et al., 2016;Yousefpour et al., 2021)).

∞ u(z)( ν(z) u(z) ) k dz and +∞ z0

For each class c, we flatten and concatenate the data from T c into a data matrix X c with |T c | rows and d columns. Each row of X c is a data sample with dimension d. We then compute the transformation matrix by QR decomposition, i.e., X c = Q c Xc . After that, we randomly sample M Gaussian noise samples {s c,i } M i=1 from N (0, I d ) and compute M synthetic samples for class c by Eq. 8, i.e., s * c,i = Q c sc,i + 1 |Tc| |Tc| j=1 x c,j , where T c = {x c,j , c} |Tc| j=1 and x c,j is also flattened. Finally, we reshape s * c,i into the original data shape. The synthetic dataset is {{s * c,i } M i=1 } C c=1 .

Figure 1: The privacy budgets of NDPDC with different σ, L, and I.

(a) Original images. (b) DM generated synthetic images (with random initialization). (c) DP-Sinkhorn generated synthetic images (d) DP-MERF generated synthetic images. (e) LDPDC generated synthetic images. (f) NDPDC generated synthetic images.

Figure 2: Visualizing the synthetic CelebA images. Female synthetic images are listed in the first row, and male synthetic images are listed in the second row.



40% ± 0.33% 83.32% ± 3.03% 83.22% ± 0.67% 85.81% ± 0.74% 88.14% ± 0.73% FMNIST 68.84% ± 0.58% 68.49% ± 1.36% 66.45% ± 0.94% 67.31% ± 1.01% 67.58% ± 0.92% CIFAR10 30.60% ± 0.18% 29.82% ± 0.81% 29.65% ± 0.82% 24.86% ± 0.29% 23.79% ± 0.59% CelebA 69.48% ± 1.51% 66.95% ± 0.53% 67.61% ± 0.87% 66.87% ± 2.23% 65.88% ± 1.42%

Performance on varied model architectures with default settings: For NDPDC, the synthetic data is learned on ConvNet-based feature extractors and evaluated on those model architectures. The privacy budgets and the results on ConvNet are given in Table

The averaged testing accuracy of Con-vNets trained on the synthetic data generated by NDPDC with different noise multiplier σ.Effects of Noise Multiplier σ on Privacy and Utility We plot the DP budgets of NDPDC with different σ in Fig.1in Appendix D and report the corresponding testing accuracy in Table4. Fig.1in Appendix D and Table4indicate that, as σ increases, ϵ will decrease, and the testing accuracy will also decrease. But the testing accuracy does not decrease much as σ increases. Thus, if we are unsatisfied with the DP budget, we could simply increase σ to obtain a low DP budget with a little loss of synthetic data utility. Additionally, we do not recommend to set σ/G ≤ 0.75, otherwise ϵ will be larger than 10, then the DP guarantee is not very useful.

The averaged testing accuracy of Con-vNets trained on the synthetic data generated by NDPDC with different group size L.Effects of Group Size L on Privacy and Utility We plot the DP budgets with different L in Fig.1in Appendix D and show the testing accuracy in Table5. If we increase L, it is expected that more original data will be sampled in each step for learning the synthetic data, and thus, both ϵ and the testing accuracy will increase. According to Fig.1in Appendix D and Table5, if we are unsatisfied with the DP budgets, we could also decrease L to 25 to obtain better DP budgets with a minor utility loss.

The averaged testing accuracy of Con-vNets trained on the synthetic data generated by NDPDC with different number of iterations I.

The averaged testing accuracy of Con-vNets achieved by NDPDC with different number of synthetic samples per class M .

The performance of NDPDC with a variety of DP budgets.

annex

Therefore, ( q 1-q ) k z0 -∞ u(z)( ν(z) u(z) ) k dz is non-increasing w.r.t. k. Given that | α k+1 | < | α k |, we have one condition for the alternating convergence test, i.e., | α k ( q 1-q ) k z0 -∞ u(z)( ν(z) u(z) ) k dz| is decreasing w.r.t. k.we have another condition for the alternating convergence test, i.e.,) k is positive, and when k ≥ ⌊α⌋ + 1, α k and α k+1 have different signs, we know that whendz have different signs. Therefore, according to the alternating series test, we can conclude that the first half in ln(•) in Eq. 30, i.e., (Next, we prove that the second half in ln(•) in Eq. 30, i.e.,u(z) ) α-k dz is a convergent series. We rewrite the series asGiven that z 0 = 1 2 + σ 2 ln(q -1 -1), we haveThus,) ≤ 1. We then haveAs a result, we have another condition for the alternating convergence test, i.e., k+1) dz have different signs. 

