DIFFERENTIALLY PRIVATE DATASET CONDENSATION

Abstract

Recent work in ICML'22 builds a theoretical connection between dataset condensation (DC) and differential privacy (DP) and claims that DC can provide privacy protection for free. However, the connection is problematic because of two controversial assumptions. In this paper, we revisit the ICML'22 work and elucidate the issues in the two controversial assumptions. To correctly connect DC and DP, we propose two differentially private dataset condensation (DPDC) algorithms-LDPDC and NDPDC. Through extensive evaluations, we demonstrate that LD-PDC has comparable performance to recent DP generative methods despite its simplicity. NDPDC provides acceptable DP guarantees with a mild utility loss, compared to distribution matching (DM). Additionally, NDPDC allows a flexible trade-off between the synthetic data utility and DP budget.

1. INTRODUCTION

Although deep learning has pushed forward the frontiers of many applications in the past decade, it still needs to overcome some challenges for broader academic and commercial use (Dilmegani, 2022) . One challenge is the costly process of algorithm design and practical implementation in deep learning, which typically requires inspection and evaluation by training many models on ample data. The growing privacy concern is another challenge. Due to the privacy concern, an increasing number of customers are reluctant to provide their data for the academia or industry to train deep learning models, and some regulations are further created to restrict access to sensitive data. Recently, dataset condensation (DC) has emerged as a potential technique to address the two challenges with one shot (Zhao et al., 2021; Zhao & Bilen, 2021b) . The main objective of dataset condensation is to condense the original dataset into a small synthetic dataset while maintaining the synthetic data utility to the greatest extent for training deep learning models. For the first challenge, both academia and industry can save computation and storage costs if using DC-generated small synthetic datasets to develop their algorithms and debug their implementations. In terms of the second challenge, since the DC-generated synthetic data may not exist in the real world, sharing DC-generated data seems less risky than sharing the original data. Nevertheless, DC-generated data may memorize a fair amount of sensitive information during the optimization process on the original data. In other words, sharing DC-generated data still exposes the data owners to privacy risk. Moreover, this privacy risk is unknown since the prior literature on DC has not proved any rigorous connection between DC and privacy. Although an ICML'22 outstanding paper (Dong et al., 2022) proved a proposition to support the claim that DC can provide certain differential privacy (DP) guarantee for free, the proof is problematic because of two controversial assumptions, as discussed in Section 3.1. To correctly connect DC and DP for bounding the privacy risk of DC, we propose two differentially private dataset condensation (DPDC) algorithms-LDPDC and NDPDC. LDPDC (Algorithm 1) adds random noise to the sum of randomly sampled original data and then divides the randomized sum by the fixed expected sample size to construct the synthetic data. Based on the framework of Rényi Differential Privacy (RDP) (Mironov, 2017; Mironov et al., 2019) , we prove Theorem 3.1 to bound the privacy risk of LDPDC. NDPDC (Algorithm 2) optimizes randomly initialized synthetic data by matching the norm-clipped representations of the synthetic data and the randomized normclipped representations of the original data. For NDPDC, we prove Theorem 3.2 bound to its privacy risk. The potential benefits brought by DPDC algorithms include (i) reducing the cost of data storage and model training; (ii) mitigating the privacy concerns from data owners; (iii) providing DP guarantees for the hyperparameter fine-tuning process and the trained models on the DPDC-generated synthetic datasets, due to the post-processing property. We conduct extensive experiments to evaluate our DPDC algorithms on multiple datasets, including MNIST, FMNIST, CIFAR10, and CelebA. We mainly compare our DPDC algorithms with a nonprivate dataset condensation method, i.e., distribution matching (Zhao & Bilen, 2021a), and two recent differentially private generative methods, i.e., DP-MERF and DP-Sinkhorn (Cao et al., 2021; Harder et al., 2021) . We demonstrate that (i) LDPDC shows comparable performance to DP-MERF and DP-Sinkhorn despite its simplicity; (ii) NDPDC can provide DP protection with a mild utility loss, compared to distribution matching; (iii) NDPDC allows a flexible trade-off between privacy and utility and can use low DP budgets to achieve better performance than DP-MERF and DP-Sinkhorn.

2. BACKGROUND AND RELATED WORK 2.1 DATASET CONDENSATION

We denote a data sample by x and its label by y. In this paper, we mainly study classification problems, where f θ (•) refers to the model with parameters θ. ℓ(f θ (x), y) refers to the cross-entropy between the model output f θ (x) and the label y. Let T and S denote the original dataset and the synthetic dataset, respectively, then we can formulate the dataset condensation problem as arg min S E (x,y)∼T ℓ(f θ(S) (x), y), where θ(S) = arg min θ E (x,y)∼S ℓ(f θ (x), y), |S| ≪ |T | (1) An intuitive method to solve the above objective is meta-learning (Wang et al., 2018) , with an inner optimization step to update θ and a outer optimization step to update S. However, this intuitive method needs a lot of cost for implicitly using second-order terms in the outer optimization step. Nguyen et al. ( 2020) considered the classification task as a ridge regression problem and derived an algorithm called kernel inducing points (KIP) to simplify the meta-learning process. Gradient matching is an alternative method (Zhao et al., 2021) for dataset condensation, which minimizes a matching loss between the model gradients on the original and synthetic data, i.e., min S E θ0∼P θ 0 [ I-1 i=1 Π M (∇ θ L S (θ i ), ∇ θ L T (θ i ))]. (2) Π M refers to the matching loss in (Zhao et al., 2021) ; ∇ θ L S (θ i ) and ∇ θ L T (θ i ) refer to the model gradients on the synthetic and original data, respectively; θ i is updated on the synthetic data to obtain θ i+1 . To boost the performance, Zhao & Bilen (2021b) further applied differentiable Siamese augmentation A w (•) with parameters w to the original data samples and the synthetic data samples. Recently, Zhao & Bilen (2021a) proposed to match the feature distributions of the original and synthetic data for dataset condensation. Zhao & Bilen (2021a) used an empirical estimate of maximum mean discrepancy (MMD) as the matching loss, i.e., E θ∼P θ ∥ 1 |T | |T | i=1 Φ θ (A w (x i )) - 1 |S| |S| i=1 Φ θ (A w (x i ))∥ 2 2 , where Φ θ (•) is the feature extractor, and P θ is a parameter distribution. With the help of differentiable data augmentation (Zhao & Bilen, 2021b), the distribution matching method (DM) (Zhao & Bilen, 2021a) achieves the state-of-the-art performance on dataset condensation. 



We note that, in this paper, performance specifically refers to the performance in producing high-utility synthetic data for data-efficient model training.



DP)(Dwork et al., 2006)  is the most widely-used mathematical definition of privacy, so we first introduce the definition of DP in the following. Definition 2.1 (Differential Privacy (DP)) For two adjacent datasets D and D ′ , and every possible output set O, if a randomized mechanism M satisfies P[M(D) ∈ O] ≤ e ϵ P[M(D ′ ) ∈ O] + δ, then M obeys (ϵ, δ)-DP. Based on the framework of DP, Abadi et al. (2016) developed DP-SGD with a moments accountant for learning differentially private models.

