FORGET UNLEARNING: TOWARDS TRUE DATA-DELETION IN MACHINE LEARNING

Abstract

Unlearning has emerged as a technique to efficiently erase information of deleted records from learned models. We show, however, that the influence created by the original presence of a data point in the training set can still be detected after running certified unlearning algorithms (which can result in its reconstruction by an adversary). Thus, under realistic assumptions about the dynamics of model releases over time and in the presence of adaptive adversaries, we show that unlearning is not equivalent to data deletion and does not guarantee the "right to be forgotten." We then propose a more robust data-deletion guarantee and show that it is necessary to satisfy differential privacy to ensure true data deletion. Under our notion, we propose an accurate, computationally efficient, and secure data-deletion machine learning algorithm in the online setting based on noisy gradient descent algorithm.

1. INTRODUCTION

Many corporations today collect their customers' private information to train Machine Learning (ML) models that power a variety of services, encompassing recommendations, searches, targeted ads, and more. To prevent any unintended use of personal data, privacy policies, such as the General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA), require that these corporations provide the "right to be forgotten" (RTBF) to their data subjects-if a user wishes to revoke access to their data, an organization must comply by erasing all information about the user without undue delay (which is typically a month). This includes ML models trained in standard ways as model inversion (Fredrikson et al., 2015) and membership inference attacks (Shokri et al., 2017; Carlini et al., 2019) demonstrate that individual training data can be exfiltrated from these models. Periodic retraining of models after excluding deleted users can be costly. So, there is a growing interest in designing computationally cheap Machine Unlearning algorithms as an alternative to retraining for erasing the influence of deleted data from (and registering the influence of added data to) trained models. Since it is generally difficult to tell how a specific data point affects a model, Ginart et al. (2019) propose quantifying the worst-case information leakage from an unlearned model through an unlearning guarantee on the mechanism, defined as a differential privacy (DP) like (ε, δ)-indistinguishability between its output and that of retraining on the updated database. With some minor variations in this definition, several mechanisms have been proposed and certified as unlearning algorithms in literature (Ginart et al., 2019; Izzo et al., 2021; Sekhari et al., 2021; Neel et al., 2021; Guo et al., 2019; Ullah et al., 2021) . However, is indistinguishability to retraining a sufficient guarantee of data deletion? We argue that it is not. In the real world, a user's decision to remove his information is often affected by what a deployed model reveals about him. The same revealed information may also affect other users' decisions. Such adaptive requests make the records in a database interdependent, causing a retrained model to contain influences of a record even if the record is no longer in the training set. We demonstrate on a certified unlearning mechanism that if an adversary is allowed to design an adaptive requester that interactively generates database edit requests as a function of published models, she can re-encode a target record in the curator's database before its deletion. We argue that under adaptive requests, measuring data-deletion via indistinguishability to retraining (as proposed by Gupta et al. (2021) ) is fundamentally flawed because it does not capture the influence a record might have previously had on the rest of the database. Our example shows a clear violation of the RTBF since even after retraining on the database with the original record removed, a model can reveal substantial information about the deleted record due to the possibility of re-encodings. Is an unlearning guarantee a sound and complete measure of data deletion when requests are non-adaptive? Again, we argue that it is neither. A sound data-deletion guarantee must ensure the non-recovery of deleted records from an infinite number of model releases after deletion. However, approximate indistinguishability to retraining implies an inability to accurately recover deleted data from a singular unlearned model only, which we argue is not sufficient. We show that certain algorithms can satisfy an unlearning guarantee yet blatantly reveal the deleted data eventually over multiple releases. The vulnerability arises in algorithms that maintain partial computations in internal data structures for speeding up subsequent deletions. These internal states can retain information even after record deletion and influence multiple future releases, making the myopic unlearning guarantee unreliable in an online setting. Several proposed unlearning algorithms in literature (Ginart et al., 2019; Neel et al., 2021) are stateful (rely on internal states) and, therefore, cannot be trusted. Secondly, unlearning is an incomplete notion of data deletion as it excludes valid data-deletion mechanisms that do not imitate retraining. For instance, a (useless) mechanism that outputs a fixed untrained model on any request is a valid deletion algorithm. However, since its output is easily distinguishable from retraining, it fails to satisfy any meaningful unlearning guarantees. This paper proposes a sound definition of data deletion that does not suffer from the abovementioned shortcomings. According to our notion, a data-deletion mechanism is reliable if A) it is stateless (i.e., it maintains no internal data structures), and B) generates models that are indistinguishable from some random variable that is independent of the deleted records. Statelessness thwarts the danger of sustained information leakage through internal data structures after deletion. Moreover, by defining data deletion as indistinguishability with any deleted-record independent random variable as oppsed to the output of retraining, we ensure reliability in presence of adaptive requests that create dependence between current and deleted records in the database. In general, we show that data-deletion mechanisms must be differentially private with respect to the remaining records to be reliable when requests are adaptive. DP also protects against membership inference attacks that extract deleted records by looking at models before and after deletion (Chen et al., 2021) . We emphasize that we are not advocating for doing data deletion through differentiallyprivate mechanisms simply because it caps the information content of all records equally, deleted or otherwise. Instead, a data-deletion mechanisms should provide two differing information reattainment bounds; one for records currently in the database in the form of a differential privacy guarantee and the other for records previously deleted in the form of a data-deletion guarantee. We also provide a reduction theorem that if a mechanism is differentially private with respect to the remaining records and satisfies a data-deletion guarantee under non-adaptive edit requests, then it also satisfies a data-deletion guarantee under adaptive requests. Based on this reduction, we redefine the problem of data-deletion as designing a mechanism that (1.) satisfies a data-deletion guarantee against nonadaptive deletion requests, (2.) is differentially private for remaining records, and (3.) has the same utility guarantee as retraining under identical differential privacy constraints. We judge the usefulness of a data-deletion mechanism based on its computational savings over retraining. For our refined problem formulation, we provide a data-deletion solution based on Noisy Gradient Descent (Noisy-GD), a popular differentially private learning algorithm (Bassily et al., 2014; Abadi et al., 2016; Chourasia et al., 2021) . Our solution demonstrates a powerful synergy between data deletion and differential privacy as the same noise needed for the privacy of records in the database also rapidly erases information regarding records deleted from the database. We provide a datadeletion guarantee for Noisy-GD in terms of Rényi divergence (Rényi et al., 1961) bound (which implies (ε, δ)-indistinguishability (Mironov, 2017) ). For convex and smooth losses, we certify that under a (q, ε dd )-Rényi data-deletion and (q, ε dp )-Rényi DP constraint, our Noisy-GD based deletion mechanism for d-dimensional models over n-sized databases under adaptive edit requests that modify no more than r records can maintain optimal excess empirical risk of the order O qd ε dp n 2 while saving Ω(n log(min{ n r , n ε dd qd }) computations in gradient complexity. Our utility guarantee matches the known lower bound on private empirical risk minimization under same privacy budget (Bassily et al., 2014) . We also provide data-deletion guarantee in the non-convex setting under the assumption that loss function is bounded and smooth, and show a computational saving of Ω(dn log n r ) in gradient complexity while maintaining an excess risk of Õ qd ε dp n 2 + 1 n q ε dp .

2. MODEL AND PRELIMINARIES 2.1 INDISTINGUISHABILITY NOTIONS AND DIFFERENTIAL PRIVACY

We provide the basics of indistinguishability of random variables (with more details in Appendix D). Let Θ, Θ be two random variables in space O with probability densities ν, ν respectively. Definition 2.1 ((ε, δ)-indistinguishability (Dwork et al., 2014) ). We say Θ and Θ are (ε, δ)- (1) Definition 2.2 (Rényi divergence (Rényi et al., 1961) ). Rényi divergence of ν w.r.t. ν of order q > 1 is defined as R q (ν ν ) = 1 q -1 log E q (ν ν ) , where E q (ν ν ) = E θ∼ν ν(θ) ν (θ) q , ( ) when ν is absolutely continuous w.r.t. ν (denoted as ν ν ). If ν ν , we'll say R q (ν ν ) = ∞. Remark 1. Rényi divergence is assymetric (i.e. R q (ν ν ) = R q (ν ν)) and implies indistinguishability only in one direction. Mironov (2017, Proposition 3) show that R q (ν ν ) ≤ ε 0 implies P [Θ ∈ O] ≤ e ε P [Θ ∈ O] + δ with ε = ε 0 + log 1/δ q-1 and any 0 < δ < 1. Definition 2.3 (Differential Privacy (Dwork et al., 2014; Mironov, 2017) ). A randomized mechanism M : X n → O is (ε, δ)-differentially private if M(D) ε,δ ≈ M(D ) for all neighbouring databases D, D ∈ X n . Similarly, M is (q, ε)-Rényi differentially private if R q (M(D) M(D )) ≤ ε.

2.2. LEARNING FRAMEWORK: ERM

Let D be a database of n ordered records taken from a data universe X and let O be the space of learnable parameters and any associated auxiliary metadata (what constitutes a metadata is clarified later). Let (θ; x) : O × X → R be a loss function of a parameter θ ∈ O for a record x ∈ X . In this paper, we consider the problem of empirical risk minimization (ERM) of the average (θ; x) over records in the database D under L2 regularization r(θ) = We build on a popular DP-ERM algorithm called Noisy-GD (Abadi et al., 2016) , described in Algorithm 1 below, and provide Rényi DP guarantees on it in Appendix G.3. Algorithm 1 Noisy-GD: Noisy Gradient Descent Require: Database D ∈ X n , start model Θ 0 ∈ O, number of steps K ∈ N. 1: for k = 0, 1, • • • , K -1 do 2: ∇L D (Θ ηk ) = 1 n x∈D ∇ (Θ ηk ; x) + ∇r(Θ ηk ) 3: Θ η(k+1) = Θ ηk -η∇L D (Θ ηk ) + √ 2ηN 0, σ 2 I d 4: Output Θ ηK

2.3. ONLINE EDIT REQUESTS AND MACHINE UNLEARNING

Suppose that any database D ∈ X n can be modified by edit requests that replaces r distinct recordsfoot_0 . Definition 2.4 (Edit request). A replacement operation ind, y ∈ [n] × X on a database D = (x 1 , • • • , x n ) performs the following modification: D • ind, y = (x 1 , • • • , x ind-1 , y, x ind+1 , • • • , x n ). (3) Let r ≤ n and U r = [n] r = × X r . An edit request u = { ind 1 , y 1 , • • • , ind r , y r } ∈ U r on D is defined as batch of r replacement operations modifying distinct indices atomically, i.e. D • u = D • ind 1 , y 1 • • • • • ind r , y r , where ind i = ind j for all i = j. Let O denote the space of model parameters plus any state variables or data-structures that may be leveraged for processing edit requests and Φ be the space of publishable outcomes such as sanitized models or predictions. For an initial database D 0 ∈ X n and an unbounded sequence of edit requests (u i ) i≥1 , the job of a data curator is to train a model Θi ∈ O with small empirical risk for each database D i = D 0 • u 1 • • • • • u i and use it to release a corresponding publishable outcome, φ i ∈ Φ. For this task, the trivial approach of retraining from scractch by executing a learning algorithm on each D i could be computationally expensive. This initiated the study of machine unlearning algorithms (Cao & Yang, 2015; Ginart et al., 2019; Guo et al., 2019; Izzo et al., 2021) that avoid the cost of retraining by instead fine-tuning an already trained model Θi-1 to cheaply erase (introduce) the influence of deleted (added) records in edit request u i to D i-1 for producing the next model Θi . In this paper, we adopt the machine unlearning formulation of Gupta et al. (2021) and Neel et al. (2021) described as follows. The curator comprises of three mechanisms: a learning algorithm A : X n → O, a data-deletion or unlearning algorithm Ā : X n × U r × O → O, and a publish function f pub : O → Φ. To generate the initial model with accompanying state metadata Θ0 ∈ O on the initial database D 0 ∈ X n , the curator executes the learning algorithm A(D 0 ). Thereafter, to process an incoming edit request u i ∈ U r at any step i ≥ 1, the curator executes data-deletion algorithm Ā(D i-1 , u i , Θi-1 ) that maps the current database D i-1 , the edit request u i , and the current model with metadata Θi-1 to the next model with metadata Θi ∈ O. While all the generated models and corresponding metadata is kept secret, the curator uses the publish function to generate the publishable outcome φ i = f pub ( Θi ) at every step i ≥ 0. et al. (2021) note that edit requests in real world could often be adaptive, i.e., a request u i may depend on (a subset of) the history of prior published outcomes

Gupta

φ ≤i = (φ 1 , • • • , φ i ). For instance, a voter may decide to change his inclination after seeing pre-election results. They model such an intearctive environment through an adaptive requester defined as follows. Definition 2.5 (Update requester (Gupta et al., 2021) ). A p-adaptive r-requester is a mapping Q : Φ ≤p × U r * → U r that takes as input a maximum of p of the published outcomes generated by the curator at arbitrary edit steps s 1 < s 2 < • • • < s p , and the entire history of previously generated edit requests to generate the next edit request. For a p-adaptive r-requester Q, the edit request u i at any step i ≥ 1 can be written as u i = Q(φ s1 , φ s2 , • • • , φ sj ; u 1 , u 2 , • • • , u i-1 ), such that s j < i. We refer to 0-adaptive requesters as non-adaptive. And, by ∞-adaptive requesters, we mean requesters that have access to the entire history of interaction transcript (φ <i ; u <i ). Since, unlike retraining, an unlearning algorithm fine-tunes a model containing the information of records to be deleted, we need statistical guarantees on the worst-case amount of information that might still remain in the unlearned model. Ginart et al. (2019) and Guo et al. (2019) propose quantifying data-deletion ability of an algorithm Ā based on its (ε, δ)-indistinguishability w.r.t. the fresh-retraining algorithm A, calling it an unlearning guarantee (more details in Appendix E.1). In this paper, we mainly consider the extension of unlearning definitions by Neel et al. (2021) and Gupta et al. (2021) to the online setting of arbitrarily long and adaptive edit sequencesfoot_1 . Definition 2.6 ((ε, δ)-unlearning (Neel et al., 2021; Gupta et al., 2021) ). We say that Ā is an (ε, δ)unlearning algorithm for A under a publish function f pub , if for all initial databases D 0 ∈ X n and all non-adaptive 1-requesters Q, the following condition holds. For every edit step i ≥ 1, and for all generated edit sequences u ≤i def = (u 1 , • • • , u i ), f pub ( Ā(D i-1 , u i , Θi-1 )) u ≤i ε,δ ≈ f pub (A(D i )). If the same condition holds for all ∞-adaptive 1-requesters Q, we say that Ā is an (ε, δ)-adaptiveunlearning algorithm for A.

3. EXISTING UNLEARNING GUARANTEES ARE UNSOUND AND INCOMPLETE

Data deletion under the law of "right to be forgotten" (RTBF) is an obligation to permanently erase all information about an individual upon a verified request. In order to comply, a corporation's actions must not reveal any information identifiable or linkable to a deleted user in the future. In this section, we argue that unlearning guarantee in Definition 2.6 is neither a sound nor a complete measure of data-deletion from ML models that RTBF enforces. Threat model. Suppose, for an arbitrary step i ≥ 1, an adversary is interested in finding out the identity of a record in the database D i-1 that was deleted by the edit request u i . Since RTBF is violated only when the curator reveals information after the deletion request, we assume that the adversary only has access to the post-deletion releases by the curator, i.e., she observes φ i , φ i+1 , • • • . Additionally, we also assume that the adversary knows how users might react to a published outcome. That is to say, our adversary knows some dependence relationship between random variables φ 0 , • • • , φ i-1 and u 1 , • • • , u i-1 , but does not explicitly observe these random variables. For instance, an adversary might know that if the outcome φ 1 predicts that "Donald Trump is winning the election," then some democratic users might delete their data while some new republican users might contribute their data to the curator. So, even though the adversary does not observe the actual outcome φ 1 or the ensuing edit request u 2 , and so on, she can still exploit knowledge about this dependence a-posteriori to infer the identity of a deleted record. To capture the worst-case knowledge about the dependence between unobserved outcomes and unobserved edit requests, we model our adversary to have the power to design an adaptive requester Q that interacts with the curator in the first i -1 steps. However, the adversary does not observe the interaction transcript (φ 0 , u 1 , φ 1 , • • • u i-1 , φ i-1 ) of Q. Unsoundness due to secret states. The unlearning Definition 2.6 is a bound on information leakage about a deleted record through a single released outcome. However, our adversary can observe multiple (potentially infinite) releases after deletion. We argue that algorithms satisfying Definition 2.6 can lead to blatant non-privacy of a deleted record under our threat model, even for a weaker adversary that cannot design an adaptive requester (i.e., knows nothing about the dependence between unobserved outcomes and edit requests). The vulnerability arises as Definition 2.6 permits algorithms to maintain secret states while using a publishing function f pub for releases. These internal states may propagate encoded information about records even after their deletion from the database. So, every subsequent release by an unlearning algorithm can reveal new information about a record that was purportedly erased multiple edits earlier. We demonstrate in the following theorem that a certified unlearning algorithm can reveal a limited amount of information about a deleted record per release so as not to break the unlearning certification, yet eventually reveal everything about the record to an adversary that observes enough future releases. Theorem 1. For every ε > 0, there exists a pair (A, Ā) of algorithms that satisfy (ε, 0)-unlearning under a publish function f pub such that for all non-adaptive 1-requesters Q, their exists an adversary that can correctly infer the identity of a record deleted at any arbitrary edit step i ≥ 1 by observing only the post-edit releases φ ≥i = (φ i , φ i+1 , • • • ). Unsoundness due to adaptivity. For an adversary that knows some dependence relationship between the unobserved outcomes and edit requests, a much more severe violation of RTBF may occur, even when an unlearning algorithm does not maintain secret internal states and perfectly imitates retraining (i.e., satisfy a (0, 0)-adaptive unlearning guarantee for identity publish function f pub (θ) = θ). This vulnerability arises because the indistinguishability in Definition 2.6 protects the privacy of deleted records but not that of records currently present. A certified adaptive unlearning algorithm is allowed to reveal unbounded information about a target record before its deletion. This revealed information can have a major influence on the subsequent edit requests in the worst case, potentially causing the curator's database to have patterns specific to the identity of the target record even after its deletion. An adversary knowing the possible patterns and their causes (i.e., the dependence relationship between unobserved outcomes and requests) can therefore infer the target records's identity from post-deletion releases. We concretize this vulnerability in the following theorem. Theorem 2. There exists a pair (A, Ā) of algorithms that satisfy (0, 0)-adaptive-unlearning for an identity publish function f pub (θ) = θ such that by designing a 1-adaptive 1-requester Q, an adversary, even with no access to Q's interaction transcript, can infer the identity of a record deleted at any arbitrary edit step i > 3 with probability at-least 1 -(1/2) i-3 from the post-edit release φ i . Incompleteness. Another issue with unlearning guarantees is that valid data-deletion algorithms may fail to satisfy it. As per Definition 2.6, the publishable outcome generated by a data-deletion mechanism must imitate that of a retraining algorithm for satisfying an unlearning guarantee. However, imitating retraining is not necessary for data deletion. For instance, consider a (useless) mechanism Ā that outputs a fixed untrained model in O regardless of its inputs. This Ā would be an unacceptable unlearning algorithm for most re-training algorithms A under an identity publish function f pub (θ) = θ. However, f pub ( Ā(•)) contains no information about the deleted records and should be acceptable under the data-deletion paradigm. Remark 2. Several prior works on machine unlearning propose data-deletion definitions similar to Definition 2.6 but without an explicit f pub and for the offline setting of a single-stage deletion to remove r many records from a learned model (Ginart et al., 2019; Guo et al., 2019; Sekhari et al., 2021) . In Appendix E.1, we show that these offline unlearning definitions are also unreliable when deletion requests are adaptive. In light of our demonstration of unsoundness of these data-deletion guarantees, we remark that several certified unlearning algorithms in literature should not be trusted to ensure RTBF.

4. REDEFINING DATA-DELETION IN MACHINE LEARNING

In this section, we redefine data deletion in Machine Learning to address the problems with the notion of unlearning that we demonstrate in the preceding section. The first change we propose is to rule out the possibility of information leakage through internal data structures (as shown in Theorem 1) by requiring deletion mechanisms to be stateless. That is, the models produced by learning or the data-deletion algorithm are directly released without applying any publish function f pub . Secondly, the following definition of a data-deletion guarantee fixes the security blind spot of an adaptive unlearning guarantee. As demonstrated in Theorem 2, an adaptive requester can encode patterns specific to a target record in the database by making edit decisions in response to the observed outcomes. Thus, being indistinguishable from retraining on the edited database does not guarantee data deletion, as the target's information remains extractable even after the target record's deletion, potentially revealing its identity. In our definition, we account for an adaptive adversary's influence by measuring the indistinguishability of a data-deletion mechanism's output from some random variable that is independent of the deleted record. Definition 4.1 ((q, ε)-data-deletion under p-adaptive r-requesters). Let q > 1, ε ≥ 0, and p, r ∈ N. We say that an algorithm pair (A, Ā) satisfies (q, ε)-data-deletion under p-adaptive r-requesters if the following condition holds for all p-adaptive r-requester Q. For every step i ≥ 1, there exists a randomized mapping π Q i : X n → O such that for all initial databases D 0 ∈ X n , R q Ā(D i-1 , u i , Θi-1 ) π Q i (D 0 • ind, y ) ≤ ε, for all u i ∈ U r and all ind, y ∈ u i . (7) We argue that the above definition is a sound guarantee on data-deletion. Suppose that an adversary is interested in identifying a record at index 'ind' in D 0 that is being replaced with record 'y' by one of the replacement operations in edit request u i ∈ U r . The inequality (7) above implies that even with the power of designing an adaptive requester Q, no adversary observing the unlearned model Ā(D i-1 , u i , Θi-1 ) can be too confident that the observation was not from π Q i (D 0 • ind, y ), a distribution that contains no information about the target record D 0 [ind] by construction. More formally, we provide the following soundness guarantee for Definition 4.1. Theorem 3 (Data-deletion Definition 4.1 is sound). If the algorithm pair (A, Ā) satisfies (q, ε)data-deletion guarantee under all p-adaptive r-requesters, then even with the power of designing an p-adaptive r-requester Q that interacts with the curator before deletion of a target record at any step i ≥ 1, any adversary observing only the post-deletion releases ( Θi , Θi+1 , • • • ) has its membership inference advantage for inferring a deleted target bounded as Adv(MI) ≤ qe ε(q-1)/q q -1 [2(q -1)] 1/q -1. Note that the bound in (8) approaches 0 as q → ∞ and ε → 0, implying Definition 4.1 is sound. Remark 3. A non-adaptive requester Q is equivalent to fixing the request sequence (u i ) i≥1 a-priori. Hence, given a non-adaptive Q, the database D i • ind, y is a deterministic function of the database D 0 • ind, y for any i ≥ 1, thanks to the commutativity of '•'. Since ind, y ∈ u i , we remark that for a non-adaptive requester Q, the random variable π Q i (D 0 • ind, y ) in (7) can be the output π(D i ) of any randomized map π : X n → O, including the learning algorithm A. Connection with DP. We highlight that our data-deletion guarantee on pair (A, Ā) is an information bound on the records that were deleted, while the standard DP guarantee on A and Ā is an information bound on the records currently present in the database. In the following theorem, we show that DP with respect to existing records is a necessary condition for (A, Ā) to satisfy adaptive data-deletion. Theorem 4 (DP is necessary for adaptive data-deletion). If learning algorithm A : X n → O is not (0, δ)-DP with respect to the replacement of a single record and deletion algorithm Ā : X n ×U ×O → O is not (0, δ)-DP with respect to the replacement of a single record that is not being deleted, then the pair (A, Ā) cannot satisfy (q, δ 4 /2)-data-deletion under 1-adaptive 1-requester for any q > 1. Additionally, if A and Ā does satisfy DP with respect to existing records then a data-deletion bound under non-adaptive requesters reduces to a data-deletion bound under adaptive requesters. Theorem 5 (Non-adaptive data-deletion with Rényi DP implies adaptive data-deletion). If an algorithm pair (A, Ā) satisfies (q, ε dd )-data-deletion under all non-adaptive r-requesters and (q, ε dp )-Rényi DP, then it also satisfies (q, ε dd + pε dp )-data-deletion under all p-adaptive r-requesters. We provide additional discussion on this reduction theorem in Appendix F.1. By virtue of our reduction Theorem 5, we reformulate the data deletion problem in ML as follows. Problem Definition. Let constants q > 1, 0 < ε dd ≤ ε dp , and α > 0. The goal is to design a learning mechanism A : X n → O and a deletion mechanisms Ā : X n × U r × O → O such that (1.) both A and Ā satisfy (q, ε dp )-Rényi DP with respect to records in the input database, (2.) pair (A, Ā) satisfies (q, ε dd )-data-deletion guarantee for all non-adaptive r-requesters Q, (3.) and, all models ( Θi ) i≥0 produced by (A, Ā, Q) on any D 0 ∈ X n have err( Θi ; D i ) ≤ α. We judge the benefit of data deletion based on the computation saving that Ā offers over A per request. That is, if we want err( Θi ; D i ) ≤ α for all i ≥ 0, where Θi is trained from scratch by A(D i ), how large is Cost(A) compared to Cost( Ā). Note that α should ideally be the optimal excess risk attainable under the (q, ε dp )-RDP constraint. We remark that one may use any reasonable measure of utility in the third constraint, such as population risk instead of excess empirical risk.

5. DATA DELETION USING NOISY GRADIENT DESCENT

This section proposes a simple and effective data-deletion solution based on Noisy-GD Algorithm 1. Our proposed approach falls under the Descent-to-Delete framework proposed by Neel et al. (2021) , wherein, after each deletion request u i , we run Noisy-GD starting from the previous model Θi-1 and perform a small number of gradient descent steps over records in the modified database D i = D i-1 • u i ; sufficient to ensure that the information about the deleted records is reduced within a desired bound in the subsequent model Θi . Formally, our data-deletion solution is defined as follows. Definition 5.1 (Data-deletion based on Noisy-GD). Let space of model parameters be O = R d . Our learning algorithm A Noisy-GD (D 0 ) = Noisy-GD(D 0 , Θ 0 , K A ) runs Noisy-GD algorithm for K A iterations to generate the first learned model Θ0 on the input database D 0 ∈ X n , with Θ 0 sampled from a weight initialization distribution ρ. Our data-deletion algorithm ĀNoisy-GD (D i-1 , u i , Θi-1 ) = Noisy-GD(D i-1 • u i , Θi-1 , KĀ) processes an edit request u i ∈ U r by running KĀ iterations of Noisy-GD algorithm on the updated database D i = D i-1 • u i to generate the updated model Θi from the current model Θi-1 ∈ R d . For this setup, our objective is to provide conditions under which the algorithm pair (A Noisy-GD , ĀNoisy-GD ) satisfies objectives (1.), (2.), and (3.) as stated in the problem definition above and analyze the computational savings of using ĀNoisy-GD over A Noisy-GD in terms of the gradient complexity n(K A -KĀ).

5.1. DATA-DELETION AND ACCURACY GUARANTEES UNDER CONVEXITY

We give the following guarantee for pair (A Noisy-GD , ĀNoisy-GD ) under convexity of loss (θ; x). Theorem 6 (Accuracy, privacy, deletion, and computation tradeoffs). Let constants λ, β, L > 0, q > 1, and 0 < ε dd ≤ ε dp . Define constant κ = λ+β λ . Let the loss function (θ; x) be twice differentiable, convex, L-Lipschitz, and β-smooth, the regularizer be r(θ) = λ 2 θ 2 2 . If the learning rate be η = 1 2(λ+β) , the gradient noise variance is σ 2 = 4qL 2 λε dp n 2 , and the weight initialization distribution is ρ = N 0, σ 2 λ(1-ηλ/2)I d , then (1.) both A Noisy-GD and ĀNoisy-GD are (q, ε dp )-RDP for any K A , KĀ ≥ 0, (2.) pair (A Noisy-GD , ĀNoisy-GD ) satisfies (q, ε dd )-data-deletion all non-adaptive r-requesters if KĀ ≥ 4κ log ε dp ε dd , (3.) and all models in ( Θi ) i≥0 produced by (A Noisy-GD , ĀNoisy-GD , Q) on any D 0 ∈ X n , where Q is any r-requester, have an excess empirical risk err( Θi ; D i ) = O qd ε dp n 2 if K A ≥ 4κ log ε dp n 2 4qd , and KĀ ≥ 4κ log max 5κ, 8ε dp r 2 qd . (10) Proof of Theorem 6 can be found in Appendix G.4. Our utility upper bound for data-deletion matches the theoretical lower bound of Ω(min 1, d ε 2 n 2 ) in Bassily et al. (2014) for the best attainable utility of (ε, δ)-DP algorithms on Lipschitz, smooth, strongly-convex loss functions 3 . Thus, our data-deletion algorithm, ĀNoisy-GD , incurs no additional cost in utility but saves substantial computation costs. Our data-deletion algorithm offers a computation saving of Ω(n log min{ n r , n ε dd qd }) per request while guaranteeing privacy, deletion, and optimal utility and without unsafe internal data structures. This saving is better than all existing unlearning algorithms in literature that we know of, and we present a detailed comparison in Table 1 . Also, observe that for satisfying (q, ε dp )-RDP and (q, ε dd )-data-deletion for non-adaptive rrequesters, the number of iterations KĀ needed is independent of the size, r, of the deletion batch, depending solely on the ratio ε dd ε dp . However, the number of iterations required for ensuring optimal utility with differential privacy grows with r. As such, we highlight that when deletion batches are sufficiently large, i.e., r ≥ qd ε dd , ensuring optimal utility under (q, ε dp )-RDP results in (q, ε dd )data-deletion for free, thus demonstrating synergy between privacy and data deletion.

5.2. DATA-DELETION AND UTILITY GUARANTEES UNDER NON-CONVEXITY

We give the following guarantee for pair (A Noisy-GD , ĀNoisy-GD ) under non-convexity of loss (θ; x). Theorem 7 (Accuracy, privacy, deletion, and computation tradeoffs). Let constants λ, β, L, σ 2 , η > 0, q, B > 1, and 0 < ε dd ≤ ε dp < d. Let the loss function (θ; x) be σ 2 log(B)

4

-bounded, L-Lipschitz and β-smooth, the regularizer be r(θ) = λ 2 θ 2 2 , and the weight initialization distribution be ρ = N 0, σ 2 λ I d . Then, (1.) both A Noisy-GD and ĀNoisy-GD are (q, ε dp )-RDP for any η ≥ 0 and any K A , KĀ ≥ 0 if σ 2 ≥ qL 2 ε dp n 2 • η max{K A , KĀ}, 3 Recall from Remark 1 that (q, ε dp )-RDP implies (ε, δ)-DP for q = 1 + 2 ε log(1/δ) and ε dp = ε/2. When ε = Θ(log(1/δ)), one can evaluate that q ε dp = Θ( log(1/δ) ε 2 ). Table 1 : Comparison of the computation savings in gradient complexity per edit request along with requirement of secret states with prior unlearning algorithms. For a fair comparison, we require that each of them satisfy an (ε, δ)-data-deletion guarantee (ignoring the statelessness requirement) and have the same excess empirical risk bound. Edit requests are non-adaptive and modify r = 1 record in n-sized databases. The models' dimension is d. We assume loss (θ; x) to be convex, 1-Lipschitz, and O(1)-smooth, and L2 regularization constant to be O(1).

Unlearning Algorithm

Requires secret states? Compute savings for ith edit Noisy-m-A-SGD [Thm. 1, (Ullah et al., 2021)] No Ω √ d 1 - √ d n Perturbed-GD [Thm. 9, (Neel et al., 2021) ] Yes Ω n log εn √ d Perturbed-GD [Thm. 28, (Neel et al., 2021) ] No Ω n log εn log 2 (id) √ d Noisy-GD [Thm. 6, Ours] No Ω n log min n, εn √ d (2.) pair (A Noisy-GD , ĀNoisy-GD ) satisfy (q, ε dd )-data-deletion under all non-adaptive r-requesters for any σ 2 > 0, if learning rate is η ≤ λε dd 64dqB(β+λ) 2 and number of iterations satisfy K A ≥ 2B λη log q log(B) ε dd , KĀ ≥ K A - 2B λη log log(B) 2 ε dd + r n log(B) , ) and all models in sequence ( Θi ) i≥0 output by (A Noisy-GD , ĀNoisy-GD , Q) on any D 0 ∈ X n , where Q is an r-requester, satisfy err( Θi ; Under non-convexity, all prior works on deletion have focused on empirical analysis for utility. As far as we know, we are the first to provide utility guarantees in this setting. Moreover, our non-convex utility bound exceeds the optimal privacy-preserving utility under convexity by only a factor of Õ 1 n qε dd ε dp , which becomes small for large databases or small deletion to privacy budget ratio. D i ) = Õ dq ε dp n 2 + 1 n qε dd ε dp when inequalities in Also, observe a strict computational benefit in using ĀNoisy-GD whenever the fraction of edited records in a single update request satisfies r n ≤ 1 2 -ε dd log B . In the deletion regime where ε dd = log(B)/4, relying on ĀNoisy-GD rather than retraining with A Noisy-GD is Ω(dn log n r ) cheaper. Remark 4. The assumption that (θ; x) is L-Lipschitz in both Theorems 6 and 7 can be removed if gradients ∇ (Θ ηk ; x) computed in step 2 of Algorithm 1 are clipped to L (discussed in Appendix C).

6. CONCLUSIONS

We showed that unlearning guarantees are an unsuitable data deletion measure in the online setting of adaptive edit requests and proposed a proper notion of data deletion in line with the "right to be forgotten." We also showed that differential privacy is necessary for data deletion when requests are adaptive and proved a reduction from secure deletion under non-adaptive requests to adaptive requests under DP. Our theoretical results on Noisy-GD based data-deletion, for both convex and non-convex losses, show a substantial computation saving over retraining with a reliable deletion guarantee and at no additional cost in utility. An example database and database after i th update. x, y Singular data records from universe X . η

A TABLE OF NOTATIONS

Step size or learning rate in Noisy-GD. σ 2 Variance scaling used in weight initialization distribution or gradient noise. (θ; x) Twice continuously differentiable loss function on models in R d . r(θ) L2 regularizer λ θ 2 2 /2. L(θ), L D (θ) Arbitrary optimization objective and an r(θ) regularized objective on D over (θ; x). err(Θ; D) Excess empirical risk of random model Θ over objective L D . π(D) An mapping from X n to distributions on R d ; sometimes distributions are Gibbs. Λ D Normalization constant of the Gibbs distribution π(D). π u i A distribution independent of record deleted by request u on database D i-1 . T k A map over R d . ρ Weight initialization distribution for Noisy-GD.  Clip L (•) Operator that clips vectors in R d to a magnitude of L. R q (ν ν ) Rényi divergence of distribution of ν w.r.t ν . E q (ν ν ) q th moment of likelihood ratio r.v. between ν and ν . I (ν ν ) , I q (ν ν ) Fisher and q-Rényi Information of distribution of ν w.r.t ν . W 2 (ν, ν ) Wasserstein distance between distribution ν and ν .

KL (ν ν )

Kullback Leibler divergence of distribution ν w.r.t. ν . P t , G, G * Markov semigroup, its infintesimal generator, and its Fokker-Planck operator. Ent π (f 2 ) Entropy of function f 2 under any arbitrary distribution π.

H(•)

Differential entropy of a distribution.

LS(c)

Log-sobolev inequality with constant c. prox L Proxmial mapping for objective L.

B CALCULUS REFRESHER

Given a twice continuously differentiable function L : O → R, where O is a closed subset of R d , its gradient ∇L : O → R d is the vector of partial derivatives ∇L(θ) = ∂L(θ) ∂θ 1 , • • • , ∂L(θ) ∂θ 2 . ( ) Its Hessian ∇ 2 L : O → R d×d is the matrix of second partial derivatives ∇ 2 L(θ) = ∂ 2 L(θ) ∂θ i θ j 1≤i,j≤d . Its Laplacian ∆L : O → R is the trace of its Hessian ∇ 2 L, i.e., ∆L(θ) = Tr ∇ 2 L(θ) . ( ) Given a differentiable vector field v = (v 1 , • • • , v d ) : O → R d , its divergence div (v) : O → R is div (v) (θ) = d i=1 ∂v i (θ) ∂θ i . ( ) Some identities that we would rely on: 1. Divergence of gradient is the Laplacian, i.e., div (∇L) (θ) = d i=1 ∂ 2 L(θ) ∂θ 2 i = ∆L(θ). 2. For any function f : O → R and a vector field v : O → R d with sufficiently fast decay at the border of O, O v(θ), ∇f (θ)dθ = - O f (θ)(div (v))(θ)dθ. ( ) 3. For any two functions f, g : O → R, out of which atleast for one the gradient decays sufficinetly fast at the border of O, the following also holds. O f (θ)∆g(θ)dθ = - O ∇f (θ), ∇g(θ) dθ = O g(θ)∆f (θ)dθ. 4. Based on Young's inequality, for two vector fields v 1 , v 2 : O → R d , and any a, b ∈ R such that ab = 1, the following inequality holds. v 1 , v 2 (θ) ≤ 1 2a v 1 (θ) 2 2 + 1 2b v 2 (θ) 2 2 . ( ) Wherever it is clear, we would drop (θ) for brevity. For example, we would represent div (v) (θ) as only div (v).

C LOSS FUNCTION PROPERTIES

In this section, we provide the formal definition of various properties that we assume in the paper. Let (θ; x) : R d × X → R be a loss function on R d for any record x ∈ X . Definition C.1 (Lipschitzness). A function (θ; x) is said to be L Lipschitz continuous if for all θ, θ ∈ R d and any x ∈ X , | (θ; x) -(θ ; x)| ≤ L θ -θ 2 . ( ) If (θ; x) is differentiable, then it is L-Lipschitz if and only if ∇ (θ; x) ≤ L for all θ ∈ R d . Definition C.2 (Boundedness). A function (θ; x) is said to be B-bounded if for all x ∈ X , its output takes values in range [-B, B]. Definition C.3 (Convexity). A continuous differential function (θ; x) is said to be convex if for all θ, θ ∈ R d and x ∈ X , (θ ; x) ≥ (θ; x) + ∇ (θ; x), θ -θ , and is said to be λ-strongly convex if (θ ; x) ≥ (θ; x) + ∇ (θ; x), θ -θ + λ 2 θ -θ 2 2 . ( ) Theorem 8 ( (Nesterov, 2003, Theorem 2.1.4)  (v) = v/ max 1, v 2 L is equivalent to optimizing L D (θ) = 1 |D| x∈D ¯ (θ; x) + r(θ), where ¯ (θ; x) is a surrogate loss function that satisfies ∇ ¯ (θ; x) = Clip L (∇ (θ; x)). This surrogate loss function inherits convexity, boundedness, and smoothness properties of (θ; x), as shown below. Lemma 10 (Gradient clipping retains convexity). If (θ; x) is a twice continuously differentiable convex function for every x ∈ R d , then surrogate loss ¯ (θ; x) resulting from gradient clipping is also convex for every x ∈ R d . Proof. Note that the clip operation Clip L (v) is a closed-form solution of the orthogonal projection onto a closed ball of radius L and centered around origin, i.e. Clip L (v) = arg min v 2 ≤L v -v 2 . ( ) By properties of orthogonal projections on closed convex sets, for every v, v ∈ R d , v -Clip L (v), v -Clip L (v) ≤ 0 if and only if v 2 ≤ L. Therefore, for any θ ∈ R d , and x ∈ X , we have ∇ ¯ (θ + hv; x) -∇ ¯ (θ; x), ∇ (θ; x) -∇ ¯ (θ; x) ≤ 0, ∇ ¯ (θ; x) -∇ ¯ (θ + hv; x), ∇ (θ + hv; x) -∇ ¯ (θ + hv; x) ≤ 0, for all unit vectors v ∈ R d and magnitude h > 0. For the directional derivative of vector field ∇ ¯ (θ; x) along v, defined as ∇ v∇ ¯ (θ; x) = lim h→0 + ∇ ¯ (θ+hv;x)-∇ ¯ (θ;x) h , the above two inequalities imply ∇ v∇ ¯ (θ; x), ∇ (θ; x) -∇ ¯ (θ; x) = 0, for all v. Therefore, when ∇ ¯ (θ; x) = ∇ (θ; x), we must have ∇ 2¯ (θ; x) = 0. And, when ∇ (θ; x) = ∇ ¯ (θ; x), gradients aren't clipped, which implies the rate of change of (θ; x) along any direction v is ∇ v • ∇ ¯ (θ; x) = lim h→0 + ∇ ¯ (θ + hv; x) -∇ (θ; x) h , v = v ∇ 2 (θ; x)v if ∃h > 0 s.t. ∇ ¯ (θ + hv; x) = ∇ (θ + hv; x) 0 otherwise ≥ 0. Lemma 11 (Gradient clipping retains boundedness). If (θ; x) is a continuously differentiable and B-bounded function for every x ∈ X , then the surrogate loss ¯ (θ; x) resulting from gradient clipping is also B-bounded. Proof. Since (θ; x) is continuously differentiable, its B-boundedness implies path integral of ∇ (θ; x) along any curve between θ, θ ∈ R d is less than 2B. Since Clip L (•) operation clips the gradient magnitude, the path integral of ∇ ¯ (θ; x) is also less than 2B. That is, the maximum and minimum values that ¯ (θ; x) takes differ no more than 2B. By adjusting the constant of path integral, we can always ensure ¯ (θ; x) takes values in range [-B, B] without affecting first order optimization algorithms. Lemma 12 (Gradient clipping retains smoothness). If (θ; x) is a continuously differentiable and β-smooth function for every x ∈ R d , then surrogate loss ¯ (θ; x) resulting from gradient clipping is also β-smooth for every x ∈ R d . Proof. Note that the gradient clipping operation is equivalent to an orthogonal projection operation into a ball of radius L, i.e. Clip L (v) = arg min v { v -v 2 : v ∈ R d , v 2 ≤ L}. Since orthogonal projection onto a closed convex set is a 1-Lipschitz operation, for any θ, θ ∈ R d , ∇ ¯ (θ; x) -∇ ¯ (θ ; x) 2 ≤ ∇ (θ; x) -∇ (θ ; x) 2 ≤ β θ -θ 2 . ( ) Additionally, the surrogate loss ¯ (θ; x) is twice differentiable almost everywhere if (θ; x) is smooth, which follows from the following Rademacher's Theorem. Theorem 13 (Rademacher's Theorem (Nekvinda & Zajíček, 1988) ). If f : R n → R n is Lipschitz continuous, then f is differentiable almost everywhere in R n . All our results in Section 5 rely on the above four properties on losses and therefore apply with gradient clipping instead of the Lipschitzness assumption.

D DIVERGENCE MEASURES AND THEIR PROPERTIES

Let Θ, Θ ∈ O be two random variables with probability measures ν, ν respectively. We abuse the notataions to denote respective probability densities with ν, ν as well. We say that ν is absolutely continuous with respect to ν (denoted by ν ν ) if for all measurable sets O ⊂ O, ν(O) = 0 whenever ν (O) = 0. Definition D.1 ((ε, δ)-indistinguishability (Dwork et al., 2014) ). We say ν and ν are (ε, δ)- indistinguishable if for all O ⊂ O, P Θ∼ν [Θ ∈ O] ≤ e ε P Θ ∼ν [Θ ∈ O] + δ and P Θ ∼ν [Θ ∈ O] ≤ e ε P Θ∼ν [Θ ∈ O] + δ. In this paper, we measure indistinguishability in terms of Rényi divergence. Definition D.2 (Rényi divergence (Rényi et al., 1961) ). Rényi divergence of ν w.r.t. ν of order q > 1 is defined as R q (ν ν ) = 1 q -1 log E q (ν ν ) , where E q (ν ν ) = E θ∼ν ν(θ) ν (θ) q , ( ) when ν is absolutely continuous w.r.t. ν (denoted as ν ν ). If ν ν , we'll say R q (ν ν ) = ∞. We abuse the notation R q (Θ Θ ) to denote divergence R q (ν ν ) between the measures of Θ, Θ . A bound on Rényi divergence implies a one-directional (ε, δ)-indistinguishability as described below. Theorem 14 (Conversion theorem of Rényi divergence (Mironov, 2017 , Proposition 3)). Let q > 1 and ε > 0. If distributions ν, ν satisfy R q (ν ν ) < ε 0 , then for any O ⊂ O, P Θ∼ν [Θ ∈ O] ≤ e ε P Θ ∼ν [Θ ∈ O] + δ, for ε = ε 0 + log 1/δ q-1 and any 0 < δ < 1. We use the following properties of Rényi divergence in some of our proofs. Theorem 15 (Mononicity of Rényi divergence (Mironov, 2017, Proposition 9)). For 1 ≤ q 0 < q, and arbitrary probability measures ν and ν over O, R q0 (ν ν ) ≤ R q (ν ν ). Theorem 16 (Rényi composition (Mironov, 2017 , Proposition 1)). If A 1 , • • • , A k are randomized algorithms satisfying, respectively, (q, ε 1 )-RDP, • • • , (q, ε k )-RDP then their composed mechanism defined as (A 1 (D), • • • , A k (D)) is (q, ε 1 + • • • + ε k )-RDP. Moreover, i th algorithm can be chosen on the basis of the outputs of algorithms A 1 , • • • , A i-1 . Theorem 17 (Weak triangle inequality of Rényi divergence (Mironov, 2017, Proposition 12) ). For any distribution ρ on O, the Rényi divergence of ν w.r.t. ν satisfies the following weak triangle inequality: R q (ν ν ) ≤ R q (ν ρ) + R ∞ (ρ ν ) . ( ) Another popular notion of information divergence is the Kullback-Leibler divergence. Definition D.3 (Kullback-Leibler divergence (Kullback & Leibler, 1951) ). Kullback-Leibler (KL) divergence KL (ν ν ) of ν w.r.t. ν is defined as KL (ν ν ) = E θ∼ν log ν(θ) ν (θ) . ( ) Rényi divergence generalizes Kullback-Leibler divergence (Van Erven & Harremos, 2014) as lim q→1 R q (ν ν ) = KL (ν ν ). Some other divergence notions that we rely on are the following. Definition D.4 (Wasserstein distance (Vaserstein, 1969) ). Wasserstein distance between ν and ν is W 2 (ν, ν ) = inf Π E Θ,Θ ∼Π Θ -Θ 2 2 1 2 , ( ) where Π is any joint distribution on O × O with ν and ν as its marginal distributions. Definition D.5 (Relative Fisher information (Otto & Villani, 2000) ). If ν ν and ν ν is differentiable, then relative Fisher information of ν with respect to ν is defined as I (ν ν ) = E θ∼ν ∇ log ν(θ) ν (θ) 2 2 . ( ) Definition D.6 (Relative Rényi information (Vempala & Wibisono, 2019) ). Let q > 1. If ν ν and ν ν is differentiable, then relative Rényi information of ν with respect to ν is defined as I q (ν ν ) = 4 q 2 E θ∼ν   ∇ ν(θ) ν (θ) q/2 2 2   = E θ∼ν ν(θ) ν (θ) q-2 ∇ ν(θ) ν (θ) 2 2 . E PROOFS FOR SECTION 3 Theorem 1. For every ε > 0, there exists a pair (A, Ā) of algorithms that satisfy (ε, 0)-unlearning under a publish function f pub such that for all non-adaptive 1-requesters Q, their exists an adversary that can correctly infer the identity of a record deleted at any arbitrary edit step i ≥ 1 by observing only the post-edit releases φ ≥i = (φ i , φ i+1 , • • • ). Proof. For a query h : X → {0, 1}, consider the task of learning the count over a database that is being edited online by a non-adaptive 1-requester Q. Since Q is non-adaptive by assumption, it is equivalent to the entire edit sequence {u i } i≥1 being fixed before interaction. We design an algorithm pair (A, Ā) for this task with secret model space being O = N 3 and published outcome space being Φ = R, with the publish function being f pub ( a, b, c ) = a + b/c + Lap 1 ε (with the convention that b/c = 0 if b = c = 0). At any step i ≥ 0, our internal model Θi = cnt i , del i , i encodes the current count of h on database D i , the count of h on records previously deleted by u ≤i , and the current step index i. Our learning algorithm initializes the secret model as Θ0 = A(D 0 ) = x∈D0 h(x), 0, 0 , and, for an edit request u i = { ind i , y i }, our algorithm Ā updates the secret model Θi-1 → Θi following the rule Θi = Ā(D i-1 , u i , Θi-1 ) = cnt i , del i , i where cnt i = cnt i-1 + h(y i ) -h(D i-1 [ind i ]), del i = del i-1 + h(D i-1 [ind i ]). Note that ∀i ≥ 1, ∆ i def = del i /i ∈ [0, 1]. Therefore, from properties of Laplace mechanism (Dwork et al., 2014) , it is straightforward to see that for all i ≥ 1, f pub ( Ā(D i-1 , u i , Θi-1 )) u ≤i = x∈Di h(x) + ∆ i + Lap 1 ε ε,0 ≈ x∈Di h(x) + Lap 1 ε = f pub (A(D i )). Hence, Ā is an (ε, 0)-unlearning algorithm for A under f pub . To show that an adversary can still infer the identity of record deleted by update u i = (ind i , •), consider a database D i-1 that differs from D i-1 only at index ind i such that h(D i-1 [ind i ]) = h(D i-1 [ind i ]). Let random variable sequences φ ≥i and φ ≥i denote the releases by Ā in the scenarios that the (i -1) th database was D i-1 and D i-1 respectively. The divergence between these two random variable sequences reflect the capacity of any adversary to infer the record deleted by u i . Since, we have identical databases after u i , i.e. D j-1 • u j = D j-1 • u j for all j ≥ i, note that both φ j and φ j are independent Laplace distributions with a shift of exactly 1 j units. Therefore, max O⊂Φ * log P [φ ≥i ∈ O] P φ ≥i ∈ O = ∞ j=i max Oj ⊂R log P [φ j ∈ O j ] P φ j ∈ O j = ∞ j=i log e ε/j = ∞. Theorem 2. There exists a pair (A, Ā) of algorithms that satisfy (0, 0)-adaptive-unlearning for an identity publish function f pub (θ) = θ such that by designing a 1-adaptive 1-requester Q, an adversary, even with no access to Q's interaction transcript, can infer the identity of a record deleted at any arbitrary edit step i > 3 with probability at-least 1 -(1/2) i-3 from the post-edit release φ i . Proof. Let data universe be X , the internal state space O, and publishable outcome space Φ be R. Consider the task of releasing a sequence of medians using function med : R * → R in the online setting when the initial database D 0 ∈ X n is being modified by some adaptive requester Q. Given a database D ∈ X n , our learning algorithm is defined as A(D) = med(D). For an arbitrary edit request u ∈ U r , our unlearning algorithm is defined as Ā(D, u, •) = med(D • u). Let the publish function f pub : O → Φ be an identity function, i.e. f pub (θ) = θ. For any initial database D 0 ∈ X n and an adaptive sequence (u i ) i≥1 generated by any ∞-adaptive 1-requester Q, note that f pub ( Ā(D i-1 , u i , •)) = f pub (A(D i )), for all i ≥ 1 and any • ∈ O. ( ) Therefore, Ā is a (0, 0)-adaptive unlearning algorithm for A under f pub . Now suppose that n is odd and D 0 consists of unique entries. W.L.O.G assume that the median record med(D 0 ) is at index ind m and its owner will be deleting it at step i by sending a non-adaptive edit request u i = { ind m , y } such that y = med(D 0 ). We design the following 1-adaptive 1-requester Q that sends edit requests in the first i -1 steps to ensure with high probability that the published outcome at step i remains the deleted record, i.e., med(D i ) = med(D 0 ): Q(φ 0 , u 1 , u 2 , • • • , u j-1 ) = { ind j , φ 0 } ∀ 1 ≤ j < i, where ind j is randomly sampled from [n] \ {ind 1 , • • • , ind j-1 } without replacement. Note that by the end of interaction, Q replaces at-least i -2 unique records in D 0 with φ 0 = med(D 0 ). If one of those original records was larger than med(D 0 ) and another was smaller than med(D 0 ), then it is guaranteed that med(D i ) = med(D 0 ). Therefore, P [med(D i ) = med(D 0 )] is at-least P ∃ind l , ind u ∈ {ind 1 , • • • , ind i-1 } s.t. D 0 [ind l ] < D 0 [ind m ] < D 0 [ind u ] ≥ 1 -2 × n /2 i -2 n i -2 ≥ 1 - 1 2 i-3 .

E.1 UNSOUNDNESS AND INCOMPLETENESS OF OFFLINE UNLEARNING DEFINITIONS

In this subsection, we show that our criticisms on both soundness and completeness of unlearning notions in Section 3 also apply to the one-stage unlearning definitions of Guo et al. ( 2019 (43) Definition E.2 ((ε, δ)-unlearning (Sekhari et al., 2021) ). For all D ⊂ X of size n and deletion subset S ⊂ D such that |S| ≤ m, a learning algorithm A and an unlearning algorithm Ā is Unsoundness. Definitions E.1 and E.2 make no assumptions about dependence between the deletion request S and the learned model A(D). So, request S can depend on A(D). This dependence is common in the real world; for example, a user deletes her information if she doesn't like what model A(D) reveals about her. We present the following construction along the same lines as our proof in Theorem 2 to show that Definitions E.1 and E.2, are unsound. (ε, δ)-unlearning if Ā(T (D), S, A(D)) ε,δ ≈ Ā(T (D \ S), ∅, A(D \ S)), For the universe of records X = {-2, -1, 1, 2}, consider the following learning and unlearning algorithms: Note that Definition E.3 assumes that the requests S is selected independently of the learned model A(D). So, our construction does not apply, keeping the possibility that their definition is sound. We remark, however, that algorithms satisfying their definitions cannot be trusted in settings where we expect some dependence between deletion requests and the learned models. A(D) = S = {x < 0 : ∀x ∈ X } if A(D) < 0, {x > 0 : ∀x ∈ X } otherwise. ( Incompleteness. Definitions E.1, E.2 and E.3 are also incomplete. Consider an unlearning algorithm Ā that outputs a fixed output x 1 ∈ X if the deletion request S = ∅ and outputs another fixed output x 2 ∈ X if the deletion request S = ∅. It is easy to see that Ā is a valid deletion algorithm as its output does not depend on the input database D or the learned model A(D). However, note that Ā does not satisfy the unlearning Definition E.2, for any learning algorithm A. And, for a learning algorithm A(D) = x∈D x, one can also verify that the pair (A, Ā) does not satisfy Definitions E.1 and E.3 either.

F PROOFS FOR SECTION 4

Theorem 3 (Data-deletion Definition 4.1 is sound). If the algorithm pair (A, Ā) satisfies (q, ε)data-deletion guarantee under all p-adaptive r-requesters, then even with the power of designing an p-adaptive r-requester Q that interacts with the curator before deletion of a target record at any step i ≥ 1, any adversary observing only the post-deletion releases ( Θi , Θi+1 , • • • ) has its membership inference advantage for inferring a deleted target bounded as Adv(MI) ≤ qe ε(q-1)/q q -1 [2(q -1)] 1/q -1. (47) Proof. For an arbitrary step i ≥ 1, suppose one of the replacement operations in the edit request u i ∈ U r replaces a record at index 'ind' from the database D i-1 with 'y'. In the worst case, this record D i-1 [ind] might have been there from the start, i.e. D 0 [ind] = D 0 [ind], and influenced all the decisions of the adaptive requester Q in the edit steps 1, • • • , i -1. To prove soundness, we need to show that if (A, Ā) satisfies (q, ε)-data-deletion, then even in this worst-case scenario, no adaptive adversary can design a membership inference test MI( Θi , Θi+1 , • • • ) ∈ {0, 1} that can distinguish with high probability the null hypothesis H 0 = {D 0 [ind] = x} from the alternate hypothesis H 1 = {D 0 [ind] = x } for any x, x ∈ X . That is, the advantage of any test MI, defined as Adv(MI) Also, since (A, Ā) satisfy (q, ε)-data-deletion for any p-adaptive r-requester Q, we know from Definition 4.1 that there exists a mapping π Q i such that for all D 0 ∈ X n , the model Θi generated by the interaction between (A, Ā, Q) on D 0 after ith edit satisfies def = P MI( Θi , Θi+1 , • • • ) = 1|H 0 -P MI( Θi , Θi+1 , • • • ) = 1|H 1 , R q Θi π Q i (D 0 • ind, y ) ≤ ε. As the database D 0 • ind, y is identical under both hypothesis H 0 and H 1 , we have R q Θi |H b Θ ≤ ε for b ∈ {0, 1}, where Θ ∼ π Q i (D 0 • ind, y ). From Rényi divergence to (ε, δ)-indistinguishability conversion described in Remark 1, we get P MI * ( Θi ) = 1|H 0 ≤ e ε (δ) P MI * ( Θ) = 1 + δ, and P MI * ( Θi ) = 0|H 1 ≤ e ε (δ) P MI * ( Θ) = 0 + δ, where ε (δ) = ε + log 1/δ q-1 for any 0 < δ < 1. On adding the two inequalities, we get: Adv(MI) ≤ Adv(MI * ) = P MI * ( Θi ) = 1|H 0 -P MI * ( Θi ) = 1|H 1 ≤ min δ e ε (δ) -1 + 2δ = qe ε(q-1)/q q -1 [2(q -1)] 1/q -1 Theorem 4 (DP is necessary for data-deletion). If learning algorithm A : X n → O is not (0, δ)-DP with respect to the replacement of a single record and deletion algorithm Ā : X n × U × O → O is not (0, δ)-DP with respect to the replacement of a single record that is not being deleted, then the pair (A, Ā) cannot satisfy (q, δ 4 /2)-data-deletion under 1-adaptive 1-requester for any q > 1. Proof. If A is not (0, δ)-DP with respect to replacement of a single record, then there exists a pair of neighbouring databases D, D such that TV (A(D); A(D )) = sup O∈O |P [A(D) ∈ O] -P [A(D ) ∈ O] | > δ. Similarly, if Ā is not (0, δ)-DP with respect to replacement of a single record that is not being deleted, then there exists a pair of databases D, D and an edit request ū ∈ U 1 such that D • ū and D • ū are neighbouring and for all θ ∈ O, TV Ā( D, ū, θ); Ā( D , ū, θ) = sup O∈O |P Ā( D, ū, θ) ∈ O -P Ā( D , ū, θ) ∈ O | > δ. ( ) Since TV distance is bounded from below in both cases, there exists tests Test, Test : O → {0, 1} such that Adv(Test; D, D ) def = P [Test(A(D)) = 1] -P [Test(A(D )) = 1] > δ, and for all θ ∈ O, Adv(Test; D, D , ū) def = P Test( Ā( D, ū, θ)) = 1 -P Test( Ā( D , ū, θ)) = 1 > δ. ( ) Assume W.L.O.G. that ū replaces at index n and the edited databases D • u, D • u still differs at index 1. Also assume that D, D differs at index n. Recall from Definition 4.1 that satisfying (q, δ 4 2 )-data-deletion under 1-adaptive 1-requesters requires existence of a map π Q n : X n → O for each Q such that for all D 0 ∈ X n , R q Ā(D n-1 , u n , Θn-1 ) π Q n (D 0 • u n ) ≤ δ 4 2 , To prove the theorem statement, we show that for a starting database D 0 ∈ {D, D } and an edit request u n = ū that deletes the differing record in choices of D 0 at edit step n, there exists a 1-adaptive 1-requester Q that sends adaptive edit requests u 1 , • • • , u n-1 in the first n -1 steps such that no map π Q n exists that satisfies (55) for both choices of D 0 . Consider the following construction of 1-adaptive 1-requester Q that only observes the first model Θ0 = A(D 0 ) and generates the edit requests (u 1 , • • • , u n-1 ) as follows: So, it must be true that the total variation distance between Θn given D 0 = D and D 0 = D is lower bounded as Q( Θ0 ; u 1 , u 2 , • • • , u i-1 ) = i, D[i] if Test( Θ0 ) = 1, i, D [i] otherwise. ( TV Θn | D0=D ; Θn | D0=D > δ 2 . ( ) From triangle inequality of total variation distance, Pinsker's inequality, and monotonicity of Rényi divergence w.r.t. order q, note that for a random variable Θ ∈ O with any arbitrary distribution, δ 2 < TV Θn | D0=D ; Θn | D0=D ≤ TV Θn | D0=D ; Θ + TV Θn | D0=D ; Θ ≤ 1 2 KL Θn | D0=D Θ + 1 2 KL Θn | D0=D Θ ≤ 2 max R q Θn | D0=D Θ , R q Θn | D0=D Θ . This implies that for all random variables Θ, and all q > 1, max R q Ā(D i-1 , u n , Θn )| D0=D Θ , R q Ā(D i-1 , u n , Θn )| D0=D Θ > δ 4 2 . ( ) But, to satisfy (q, δ 4 2 )-data-deletion under 1-adaptive 1-requesters, there must exist a mapping π Q n for which (55) must hold for both choices of D 0 ∈ {D, D }. Since the random variable π Q n (D 0 • u n ) is identically distributed in our construction for both choices of D 0 ∈ {D, D }, from (58) we have that no such map π Q n can exist. Theorem 5 (Reduction from Adaptive to Non-adaptive Data Deletion). If an algorithm pair (A, Ā) satisfy (q, ε dd )-data-deletion under all non-adaptive r-requesters and (q, ε dp )-Rényi differential privacy, then it also satisfies (q, ε dd + pε dp )-data-deletion for all p-adaptive r-requesters. Proof. To prove this theorem, we need to show that for any p-adaptive r-requester Q, there exists a construction for a map π Q i : X n → O such that for all D 0 ∈ X n , the sequence of model ( Θi ) i≥0 generated by the interaction between (Q, A, Ā) on D 0 satisfies the following inequaltiy for all i ≥ 1: R q Ā(D i-1 , u i , Θi-1 ) π Q i (D 0 • ind, y ) ≤ ε dd + pε dp , for all u i ∈ U r and ind, y ∈ u i . (59) Fix a database D 0 ∈ X n and an edit request u i ∈ U r . Let D 0 ∈ X n be a neighbouring database defined to be D 0 = D 0 • ind, y for an arbitrary replacement operation ind, y ∈ u i . Given any p-adaptive r-requester Q, let ( Θi ) i≥0 and (U i ) i≥1 be the sequence of released model and edit request random variables generated on Q's interaction with (A, Ā) with inital database as D 0 . Similarly, let ( Θ i ) i≥0 and (U i ) i≥1 be the corresponding sequences generated due to the interaction among (Q, A, Ā) on D 0 . Since (A, Ā) is assumed to satisfy (q, ε dd )-data-deletion guarantee under non-adpative r-requesters, recall from Remark 3 that there exists a mapping π : X n → O such that for any fixed edit sequence u ≤i def = (u 1 , u 2 , • • • , u i ), R q Θi | U ≤i =u ≤i π(D 0 • u ≤i ) ≤ ε dd (60) =⇒ R q Ā(D 0 • U <i , u i , Θi )| U<i=u<i π(D 0 • U <i • u i )| U<i=u<i ≤ ε dd . Note that since the replacement operation ind, y is part of the edit request u i , we have D 0 • U <i • u i = D 0 • U <i • u i . Moreover, since the sequence U <i of edit requests is generated by the interaction of (Q, A, Ā) on D 0 = D 0 • ind, u and the ith edit request u i is fixed beforehand, we can define a valid construction of a map π Q i : X n → O as per Definition 4.1 as follows: π Q i (D 0 • ind, y ) = π(D 0 • U <i • u i ). For brevitly, let Θu = Ā(D 0 • U <i , u i , Θi-1 ), and Θ u = π Q i (D 0 • ind, y ). For this construction, we prove the requisite bound in (59) as follows. R q Θu Θ u ≤ R q ( Θu , U <i ) ( Θ u , U <i ) (Data processing inequality (Van Erven & Harremos, 2014, Theorem 1)) = 1 q -1 log θ u<i J(θ, u <i ) q J (θ, u <i ) q-1 dθ (J & J are joint PDFs of ( Θu , U <i ) & ( Θ u , U <i )) = 1 q -1 log u<i P [U <i = u <i ] q P U <i = u <i q-1 θ p Θu|U<i=u<i (θ) q p Θ u |U <i =u<i (θ) q-1 dθ ≤ 1 q -1 log u<i P [U <i = u <i ] q P U <i = u <i q-1 exp((q -1)ε dd ) (From (61)) = ε dd + R q (U <i U <i ) ≤ ε dd + R q Θs 1 , • • • , Θs p Θ s 1 , • • • , Θ s p (If Q sees outputs at steps s 1 , • • • , s p ) ≤ ε dd + pε dp . (Via Rényi composition) F.1 CONTRASTING OUR REDUCTION THEOREM 5 WITH GUPTA ET AL. ( 2021)'S RESULTS The recent work of Gupta et al. (2021) , also studies adaptive data deletion and proves a reduction from adaptive unlearning guarantee to non-adaptive unlearning guarantee in Definition 2.6 under differential privacy. We remark that the reduction Theorem 3.1 by Gupta et al. (2021) relies on DP with regards to a change in the description of learning/unlearning algorithm's coins and not with regards to the standard replacement of records. In contrast, our Theorem 5 presents a reduction from adaptive to non-adaptive data-deletion guarantee under DP with respect to the standard replacement of records. We emphasize that these two Theorems are fundamentally different. The adaptive unlearning definition of Gupta et al. (2021) is designed to ensure with a high probability that no adaptive requester Q can force the output distribution of the unlearning algorithm Ā(D i-1 , u i , Θi-1 ) to diverge substantially from that of retraining algorithm A(D i ). Such an attack is possible in stateful unlearning algorithms that rely on persistent structures that are only randomized once during initialization, for example, the initial partitioning of start database D 0 in Bourtoule et al. (2021)'s SISA algorithm. Gupta et al. (2021) show in their Theorem 5.1 that an adaptive update requester Q can interactively send deletion requests u 1 , • • • , u i to SISA so that the partitioning of remaining records in D i = D 0 • u 1 • • • u i follows a pattern that is unlikely to occur on repartitioning of D i when executing A(D i ). As proved in their reduction in Theorem 3.1, a straightforward way to prevent this is by ensuring that the uncertainty regarding the persistent structures remains private for long periods of time to an adversary observing the unlearned model. Hence the proof of their reduction from adaptive unlearning guarantee to non-adaptive unlearning guarantee relies on DP with regards to the coins of the unlearning algorithm. Our work shows that satisfying adaptive unlearning definition of Gupta et al. (2021) still does not guarantee data deletion. In Theorem 2, we demonstrate that there exists an algorithm pair (A, Ā) satisfying (a strictly stronger version) of adaptive unlearning (Gupta et al., 2021, Definition 2.3 ), but still causes blatant non-privacy of deleted records in post-deletion release. The vulnerability we identify occurs because an adaptive requester can learn the identity of any target record before it is deleted and re-encode it back in the curator's database by sending edit requests. Because of this, an adversary (who knows how the adaptive requester works but does not have access to the requester's interaction transcript) can extract the identity of the target record from the model released after processing the deletion request. In our work, we argue that a reliable (and necessary) way to prevent this attack is to make sure that no adaptive requester ever learns the identity of a target record from the p pre-deletion model releases it has access to. Consequently, our reduction in Theorem 5 from adaptive to non-adaptive requests relies on differential privacy with respect to the replacement of records instead. G ADDITIONAL PRELIMINARIES AND PROOFS FOR SECTION 5 G.1 LANGEVIN DIFFUSION AND MARKOV SEMIGROUPS Langevin diffusion process on R d with noise variance σ 2 under the influence of a potential L : R d → R is characterized by the Stochastic Differential Equation (SDE) dΘ t = -∇L(Θ t )dt + √ 2σ 2 dZ t , where dZ t = Z t+dt -Z t ∼ √ dtN (0, I d ) is the d-dimensional Weiner process. We present some preliminary knowledge on the diffusion theory used in our analysis. Let p t (θ 0 , θ t ) denote the probability density function describing the distribution of Θ t , on starting from Θ 0 = θ 0 at time t = 0. For SDE (63), the associated Markov Semigroup P, is defined as a family of operators (P t ) t≥0 , such that an operator P t sends any real-valued bounded measurable function f : R d → R to P t f (θ 0 ) = E [f (Θ t )|Θ 0 = θ 0 ] = f (θ t )p t (θ 0 , θ t )dθ t . The infentisimal generator G def = lim s→0 1 s [P t+s -P s ] for this diffusion Semigroup is Gf = σ 2 ∆f -∇L, ∇f . This generator G, when applied on a function f (θ t ), gives the infentisimal change in the value of a function f when θ t undergoes diffusion as per (63) for dt time. That is, ∂ t P t f (θ 0 ) = ∂ t p t (θ 0 , θ t )f (θ t )dθ t = p t (θ 0 , θ t )Gf (θ t )dθ t . The dual operator of G is the Fokker-Planck operator G * , which is defined as the adjoint of generator G, in the sense that f G * gdθ = gGf dθ, for all real-valued bounded measurable functions f, g : R d → R. Note from (66) that, this operator provides an alternative way to represent the rate of change of function f at time t: ∂ t P t f (θ 0 ) = f (θ t )G * p t (θ 0 , θ t )dθ t . To put it simply, Fokker-Planck operator gives the infentesimal change in the distribution of Θ t with respect to time. For the Langevin diffusion SDE (63), the Fokker-Planck operator is the following: ∂ t p t (θ) = G * p t (θ) = div (p t (θ)∇L(θ)) + σ 2 ∆p t (θ). ( ) From this Fokker-Placnk equation, one can verify that the stationary or invariant distribution π of Langevin diffusion, which is the solution of ∂ t p t = 0, follows the Gibbs distribution π(θ) ∝ e -L(θ)/σ 2 . ( ) Since π is the stationary distribution, note that for any measurable bounded function f : R d → R, E π [Gf ] = f G * πdθ = 0.

G.2 ISOPERIMETRIC INEQUALITIES AND THEIR PROPERTIES

Convergence properties of various diffusion semigroups have been extensively analyzed in literature under certain isoperimetric assumptions on the stationary distribution π (Bakry et al., 2014) . One such property of interest is the logarithmic Sobolev (LS) inequality (Gross, 1975) , which we define next. The carré du champ operator Γ of a diffusion semigroup with invariant measure µ is defined using its infintesimal generator G as Γ(f, g) = 1 2 [G(f g) -f Gg -gGf ] , for every f, g ∈ L 2 (µ). Carre du champ operator represent fundamental properties of a Markov semigroup that affect its convergence behaviour. One can verify that Langevin diffusion semigroup's carre du champ operator (on differentiable f, g) is Γ(f, g) = σ 2 ∇f, ∇g . ( ) We use shorthand notation Γ(f ) = Γ(f, f ) = σ 2 ∇f 2 . Definition G.1 (Logarithmic Sobolev Inequality (see Bakry et al. (2014, p. 24) )). A distribution with probability density π is said to satisfy a logarithmic Sobolev inequality (LS(c)) (with respect to Γ in (73)) if for all functions f ∈ L 2 (µ) with continuous derivatives ∇f , Ent π (f 2 ) ≤ 1 2c Γ(f 2 ) f 2 πdθ = 2σ 2 c ∇f 2 2 πdθ, where entropy Ent π is defined as Ent π (f 2 ) = E π f 2 log f 2 -E π f 2 log E π f 2 . ( ) Logarithmic Sobolev inequality is a very non-restrictive assumption and is satisfied by a large class of distributions. The following well-known result show that Gaussians satisfy LS inequality. Lemma 18 (LS inequality of Gaussian distributions (see Bakry et al. (2014, p. 258) )). Let ρ be a Gaussian distribution on R d with covariance σ 2 /λ (i.e., the Gibbs distribution (70) with L(•) being the L2 regularizer r(θ) = λ 2 θ 2 2 ). Then ρ satisfies LS(λ) tightly (with respect to Γ in (73)), i.e. Ent ρ (f 2 ) = 2σ 2 λ ∇f 2 2 ρdθ. ( ) Additionally, if µ is a distribution on R d that satisfy LS(c), then the convolution µ ρ, defined as the distribution of Θ + Z where Θ ∼ µ and Z ∼ π, satisfies LS inequality with constant Bobkov (2007) show that like Gaussians, all strongly log concave distributions (or more generally, logconcave distributions with finite second order moments) satisfy LS inequality (e.g. Gibbs distribution π with any strongly convex L). LS inequality is also satisfied under non-log-concavity too. For example, LS inequality is stable under Lipschitz maps, although such maps can destroy log-concavity. 1 c + 1 λ -1 . Lemma 19 (LS inequality under Lipschitz maps (see Ledoux (2001) )). If π is a distribution on R d that satisfies LS(c), then for any L-lipschitz map T : R d → R d , the pushforward distribution T #π , representing the distribution of T (Θ) when Θ ∼ π, satisfies LS(c/L 2 ). LS inequality is also stable under bounded perturbations to the distribution, as shown in the following lemma by Holley & Stroock (1986) . Lemma 20 (LS inequality under bounded perturbations (see Holley & Stroock (1986) )). If π is the probability density of a distribution that satisfies LS(c), then any proability distribution with density π such that 1 √ B ≤ π(θ) π (θ) ≤ √ B everywhere in R d for some constant B > 1 satisfies LS(c/B). Logarithmic Sobolev inequality is of interest to us due to its equivalence to the following inequalities on Kullback-Leibler and Rényi divergence. Lemma 21 (LS inequality in terms of KL divergence (Vempala & Wibisono, 2019) ). The distribution π satisfies LS(c) inequality (with respect to Γ in (73)) if and only if for all distributions µ on R d such that µ π ∈ L 2 (π) with continuous derivatives ∇ µ π , KL (µ π) ≤ σ 2 2c I (µ π) . ( ) Proof. Set f 2 in (74) to µ π to obtain (77). Alternatively, set µ 77) to obtain (74). = f 2 π E π [f 2 ] in ( Lemma 22 (Wasserstein distance bound under LS inequality (Otto & Villani, 2000 , Theorem 1)). If distribution π satisfies LS(c) inequality (with respect to Γ in (73)) then for all distributions µ on R d , W 2 (µ, π) 2 ≤ 2σ 2 c KL (µ π) . ( ) Lemma 23 (LS inequality in terms of Rényi Divergence (Vempala & Wibisono, 2019) ). The distribution π satisfies LS(c) inequality (with respect to Γ in (73)) if and only if for all distributions µ on R d such that µ π ∈ L 2 (π) with continuous derivatives ∇ µ π , and any q > 1, R q (µ π) + q(q -1)∂ q R q (µ π) ≤ q 2 σ 2 2c I q (µ π) E q (µ π) . ( ) Proof. For brevity, let the functions R(q) = R q (µ π), E(q) = E q (µ π), and I(q) = I q (µ π). Let function f 2 (θ) = µ(θ) π(θ) q . Then, E π f 2 = E π µ π q = E(q), (From (34)) and, E π f 2 log f 2 = E π µ π q log µ π q = q∂ q E π q µ π q log µ π dq = q∂ q E π µ π q = q∂ q E(q). (From Lebniz rule and (34)) Moreover, E π ∇f 2 2 = E π ∇ µ π q 2 2 2 = q 2 4 I(q) (From (40)) On substituting (74) with the above equalities, we get: Ent π (f 2 ) ≤ 2σ 2 c E π ∇f 2 2 ⇐⇒ q∂ q E(q) -E(q) log E(q) ≤ q 2 σ 2 2c I(q) ⇐⇒ q∂ q log E(q) -log E(q) ≤ q 2 σ 2 2c I(q) E(q) ⇐⇒ q∂ q ((q -1)R(q)) -(q -1)R(q) ≤ q 2 σ 2 2c I(q) E(q) (From (34)) ⇐⇒ R(q) + q(q -1)∂ q R(q) ≤ q 2 σ 2 2c I(q) E(q)

G.3 (RÉNYI) DIFFERENTIAL PRIVACY GUARANTEES ON NOISY-GD

In this section, we recap the differential privacy bounds in literature for Noisy-GD Algorithm 1. Theorem 24 (Rényi DP guarantee for Noisy-GD Algorithm 1). If (θ; x) is L-lipschitz, then Noisy- GD satisfies (q, ε)-RDP with ε = qL 2 σ 2 n 2 • ηK. Proof. The L 2 sensitivity of gradient ∇L D (θ) def = 1 n x∈D ∇ (θ; x) + ∇r(θ) computed in step 2 of Algorithm 1 for neighboring databases in X n that differ in a single record is 2L n since (θ; x) is L-Lipschitz. Conditioned on observing the intermediate model Θ ηk = θ k at step k, the next model Θ η(k+1) after the noisy gradient update is a Gaussian mechanism with noise variance 2σ 2 /η. So, for neighboring databases D, D ∈ X n , we have from the Rényi DP bound of Gaussian mechanisms proposed by Mironov (2017, Proposition 7) that R q Θ η(k+1) | Θ ηk =θ k Θ η(k+1) | Θ ηk =θ k ≤ ηqL 2 n 2 σ 2 , ( ) where (Θ ηk ) 0≤k≤K and (Θ ηk ) 0≤k≤K are intermediate parameters in Algorithm 1 when run on databases D and D respectively. Finally, from Rényi composition Mironov (2017, Proposition 1), we have R q Θ ηK Θ ηK ≤ R q (Θ 0 , Θ η , • • • , Θ ηK ) (Θ 0 , Θ η , • • • , Θ ηK ) ≤ K-1 k=0 R q Θ η(k+1) | Θ ηk =θ k Θ η(k+1) | Θ ηk =θ k ≤ qL 2 n 2 σ 2 • ηK. Remark 5. Different papers discussing Noisy-GD variants adopt different notational conventions for the total noise added to the gradients. The noise variance in our Algorithm 1 is 2ησ 2 ; but is η 2 σ 2 L 2 n 2 in the full-batch setting of DP-SGD by Abadi et al. (2016) . To translate the bound in Theorem 24, one can simply rescale σ across different conventions to have the same noise variance, i.e., 2ησ 2 = η 2 σ2 L 2 n 2 . Our Theorem 24 is somewhat identical to Abadi et al. (2016) 's (ε, δ)-DP bound. To verify this, note from Rényi divergece to (ε, δ)-indistinguishability conversion discussed in Remark 1 that (1 + 2 ε log 1 δ , ε 2 )-Rényi DP implies (ε, δ)-DP. So, setting the bound in Theorem 24 to be smaller than ε 2 and substituting q = 1 + 2 ε log 1 δ , we get ε + 2 log 1 δ ε L 2 n 2 σ 2 • ηK ≤ ε 2 ⇐⇒ K(ε + 2 log 1 δ ) ε ≤ σ. For ε ≤ 2 log 1 δ , we get the same noise bound as in Abadi et al. (2016, Theorem 1) for their (full-batch) DP-SGD algorithm. Next, we recap the tighter Rényi DP guarantee of Chourasia et al. (2021) under stronger assumptions on the loss function. Theorem 25 (Rényi DP guarantee for Noisy-GD Algorithm 1 (Chourasia et al., 2021) ). If (θ; x) is convex, L-Lipschitz, and β-smooth and r(θ) is the L2 regularizer with constant λ, then Noisy-GD with learning rate η < 1 β+λ satisfies (q, ε)-RDP with ε = 4qL 2 λσ 2 n 2 1 -e -ληK/2 .

G.4 PROOFS FOR SUBSECTION 5.1

In this appendix, we provide a proof of our Theorem 6 which applies to convex losses (θ; x) under L2 regularizer r(θ). Let D 0 ∈ X n be any arbitrary database, and Q be any non-adaptive r-requester. Our first goal in this section is to prove (q, ε dd )-data-deletion guarantees on our proposed algorithm pair (A Noisy-GD , ĀNoisy-GD ) (in Definition 5.1) under Q. That is, if ( Θi ) i≥0 is the sequence of models produced by the interaction between (A Noisy-GD , ĀNoisy-GD , Q) on D 0 , we need to show that their exists a mapping π Q i such that for all i ≥ 1 and any u i ∈ U r , R q Ā(D i-1 , u i , Θi-1 ) π Q i (D 0 • ind, y ) ≤ ε dd for all ind, y ∈ u i . ( ) For an arbitrary replacement operation ind, y in u i , we define a map π Q i (D 0 • ind, y ) = Θ i , where the model sequence ( Θ i ) i≥0 is produced by the interaction of between (A Noisy-GD , ĀNoisy-GD , Q) on D 0 • ind, y . Since non-adaptive requester Q is equivalent to fixing the edit sequence (u i ) i≥1 a-priori, note that showing the data-deletion guarantee reduces to proving the following DP-like bound R q Ā(D i-1 , u i , Θi-1 ) Ā(D i-1 , u i , Θ i-1 ) ≤ ε dd , for for all u ≤i and for all neighbouring databases D 0 , D 0 s.t. D 0 = D 0 • ind, y with ind, y ∈ u i . Note from our Definition 5.1 that the sequence of models ( Θ0 , • • • , Θi ) can be seen as being generated from a continuous run of Noisy-GD, where: 1. for iterations 0 ≤ k < K A , the loss function is L D0 , 2. for the iterations K A + (j -1)KĀ ≤ k < K A + jKĀ on any 1 ≤ j ≤ i -1, the loss function is L Dj , and 3. for the iterations K A + (i -1)KĀ ≤ k < K A + iKĀ, the loss function is L Di-1•ui . Let (Θ ηk ) 0≤k≤KA+iK Ā be the sequence representing the intermediate parameters of this extended Noisy-GD run. Similarly, let (Θ ηk ) k≥0 be the parameter sequence corresponding to the extended run on the neighbouring database D 0 . Since ind, y ∈ u i , note from the construction that D i-1 • u i = D i-1 • u i , meaning that the loss functions while processing request u i is identical for the two processes, i.e. L Di-1•ui = L D i-1 •ui . For brevity, we refer to the database seen in iteration k of the two respective extended runs as D(k) and D (k) respectively. In short, these two discrete processes induced by Noisy-GD follow the following update rule for any 0 ≤ k < K A + iKĀ: Θ η(k+1) = Θ ηk -η∇L D(k) (Θ ηk ) + 2ησ 2 Z k Θ η(k+1) = Θ ηk -η∇L D (k) (Θ ηk ) + 2ησ 2 Z k , where Z k , Z k ∼ N (0, I d ) , and Θ 0 and Θ 0 are sampled from same the weight initialization distribution ρ. To prove the bound in (82), we follow the approach proposed in Chourasia et al. (2021) of interpolating the two discrete stochastic process of Noisy-GD with two piecewise-continuous tracing diffusions Θ t and Θ t in the duration ηk < t ≤ η(k + 1), defined as follows. Θ t = T k (Θ ηk ) -(t-ηk) 2 ∇ L D(k) (Θ ηk ) -L D (k) (Θ ηk ) + √ 2σ 2 (Z t -Z ηk ), Θ t = T k (Θ ηk ) + (t-ηk) 2 ∇ L D(k) (Θ ηk ) -L D (k) (Θ ηk ) + √ 2σ 2 (Z t -Z ηk ), where Z t , Z t are two independent Weiner processes, and T k is a map on R d defined as T k = I d - η 2 ∇ L D(k) + L D (k) . Note that equation ( 84) is identical to (83) when t = η(k + 1), and can be expressed by the following stochastic differential equations (SDEs): dΘ t = -g k (Θ ηk )dt + √ 2σ 2 dZ t dΘ t = +g k (Θ ηk )dt + √ 2σ 2 dZ t , where g k (Θ) = 1 2n ∇ [ (Θ; D(k)[ind]) -(Θ; D (k)[ind])] , (86) and initial condition lim t→ηk + Θ t = T k (Θ ηk ), lim t→ηk + Θ t = T k (Θ ηk ). These two SDEs can be equivalently described by the following pair of Fokker-Planck equations. Lemma 26 (Fokker-Planck equation for SDE (86)). Fokker-Planck equation for SDE in (86) at time ηk < t ≤ η(k + 1), is    ∂ t µ t (θ) = div µ t (θ)E [g k (Θ ηk )|Θ t = θ] + σ 2 ∆µ t (θ), ∂ t µ t (θ) = div µ t (θ)E -g k (Θ ηk ) Θ t = θ + σ 2 ∆µ t (θ), where µ t and µ t are the densities of Θ t and Θ t respectively. Lemma 28. If (θ; x) is a twice continuously differentiable, convex, and β-smooth loss function and regularizer is r(θ) = λ 2 θ 2 2 , then the map T k defined in (85) is: 1. a differentiable bijection for any η < 1 λ+β , and 2. (1 -ηλ)-Lipschitz for any η ≤ 2 2λ+β . Proof. Differentiable bijection. To see that T k is injective, assume T k (θ) = T k (θ ) for some θ, θ ∈ R d . Then, by (β + λ)-smoothness of L def = (L D(k) + L D (k) )/2, θ -θ 2 = T k (θ) + η∇L(θ) -T k (θ ) -η∇L(θ ) 2 = η ∇L(θ) -∇L(θ ) 2 ≤ η(λ + β) θ -θ 2 . Since η < 1/(λ + β), we must have θ -θ 2 = 0. For showing T k is surjective, consider the proximal mapping prox L (θ) = arg min θ ∈R d θ -θ 2 2 2 -ηL(θ ). Note that prox L (•) is strongly convex for η < 1 λ+β . Therefore, from KKT conditions, we have θ = prox L (θ) -η∇L(prox L (θ)) = T k (prox L (θ)) . Differentiability of T k follows from the twice continuously differentiable assumption on (θ; x). Lipschitzness. Let L def = (L D(k) + L D (k) )/2. For any θ, θ ∈ R d , T k (θ) -T k (θ ) 2 2 = θ -η∇L(θ) -θ + η∇L(θ ) 2 2 = θ -θ 2 2 + η 2 ∇L(θ) -∇L(θ ) 2 2 -2η θ -θ , ∇L(θ) -∇L(θ . We define a function g(θ) = L(θ) -λ 2 θ 2 2 , which is convex and β-smooth. By co-coercivity property of convex and β-smooth functions, we have θ -θ , ∇g(θ) -∇g(θ ) ≥ 1 β ∇g(θ) -∇g(θ ) 2 2 =⇒ θ -θ , ∇L(θ) -∇L(θ ) -λ θ -θ 2 2 ≥ 1 β ∇L(θ) -∇L(θ ) 2 2 + λ 2 θ -θ 2 2 -2λ θ -θ , ∇L(θ) -∇L(θ ) =⇒ θ -θ , ∇L(θ) -∇L(θ ) ≥ 1 2λ + β ∇L(θ) -∇L(θ ) 2 2 + λ(λ + β) 2λ + β θ -θ 2 2 . Substituting this in the above inequality, and noting that η ≤ 2 2λ+β , we get T k (θ) -T k (θ ) 2 2 ≤ 1 - 2ηλ(λ + β) 2λ + β θ -θ 2 2 + η 2 - 2η β + 2λ ∇L(θ) -∇L(θ ) 2 2 ≤ 1 - 2ηλ(λ + β) 2λ + β θ -θ 2 2 + η 2 λ 2 - 2ηλ 2 β + 2λ θ -θ 2 2 = (1 -ηλ) 2 θ -θ 2 2 . The second set of lemmas presented below describe how R q (Θ t Θ t ) evolves with time in both phases I and II. Central to our analysis is the following lemma which bounds the rate of change of Rényi divergence for any pair of diffusion process characterized by their Fokker-Planck equations. Lemma 29 (Rate of change of Rényi divergence (Chourasia et al., 2021) ). Let V t , V t : R d → R d be two time dependent vector field such that max θ∈R d V t (θ) -V t (θ) 2 ≤ L for all θ ∈ R d and t ≥ 0. For a diffusion process (Θ t ) t≥0 and (Θ t ) t≥0 defined by the Fokker-Planck equations ∂ t µ t (θ) = div (µ t (θ)V t (θ)) + σ 2 ∆µ t (θ) and ∂ t µ t (θ) = div (µ t (θ)V t (θ)) + σ 2 ∆µ t (θ), respectively, where µ t and µ t are the densities of Θ t and Θ t , the rate of change of Rényi divergence between the two at any t ≥ 0 is upper bounded as ∂ t R q (µ t µ t ) ≤ qL 2 2σ 2 - qσ 2 2 I q (µ t µ t ) E q (µ t µ t ) . ( ) We will apply the above lemma to the Fokker-Planck equation ( 87) of our pair of tracing diffusion SDE (84) and solve the resulting differential inequality to prove the bound in (82). To assist our proof, we rely on the following lemma showing that our two tracing diffusion satisfy the LS inequality described in Definition G.1, which enables the use the inequality (79) in Lemma 23. Lemma 30. If loss (θ; x) is convex and β-smooth, regularizer is r(θ) = λ 2 θ 2 2 , and learning rate η ≤ 2 2λ+β , then the tracing diffusion (Θ t ) 0≤t≤η(KA+iK Ā ) and (Θ t ) 0≤t≤η(KA+iK Ā ) defined in (84) with Θ 0 , Θ 0 ∼ ρ = N 0, σ 2 λ(1-ηλ/2) I d satisfy LS inequality with constant λ(1 -ηλ/2). Proof. For any iteration 0 ≤ k < K A + iKĀ in the extended run of Noisy-GD, and any 0 ≤ s ≤ η, let's define two functions L s , L s : R d → R as follows: L s = 1 + s/η 2 L D(k) + 1 -s/η 2 L D (k) , and L s = 1 -s/η 2 L D(k) + 1 + s/η 2 L D (k) . Since r(•) is the L2(λ) regularizer and (θ; x) is convex and β-smoothness, both L s and L s are λstrongly convex and (λ + β)-smooth for all 0 ≤ s ≤ η and any k. We define maps T s , T s : R d → R d as T s (θ) = θ -η∇L s (θ), and T s (θ) = θ -∇L s (θ). From a similar argument as in Lemma 28, both T s and T s are (1 -ηλ)-Lipschitz for learning rate η ≤ 2 2λ+β . Note that the densities of Θ t and Θ t of the tracing diffusion for t = ηk + s can be respectively expressed as µ t = T s# (µ ηk ) N 0, 2sσ 2 I d , and µ t = T s# (µ ηk ) N 0, 2sσ 2 I d , where µ ηk and µ ηk represent the distributions of Θ ηk and Θ ηk . We prove the lemma via induction. Base step: Since Θ 0 , Θ 0 are both Gaussian distributed with variance σ 2 λ(1-ηλ/2) , from Lemma 18 they satisfy LS inequality with constant λ(1 -ηλ/2). Induction step: Suppose µ ηk and µ ηk satisfy LS inequality with constant λ(1 -ηλ/2). Since equation ( 95) shows that µ t , µ t are both gaussian convolution on a pushforward distribution of µ ηk , µ ηk respectively over a Lipschitz function, from Lemma 18 and Lemma 19, both µ t , µ t satisfy LS inequality with constant (1 -ηλ) 2 λ(1 -ηλ/2) + 2s -1 ≥ λ(1 -ηλ/2) × [(1 -ηλ) 2 + λη(2 -ηλ)] -1 =1 , for all ηk ≤ t ≤ η(k + 1). We are now ready to prove the data-deletion bound in (82). Theorem 31 (Data-Deletion guarantee on (A Noisy-GD , ĀNoisy-GD ) under convexity). Let the weight initialization distribution be ρ = N 0, σ 2 λ(1-ηλ/2) , the loss function (θ; x) be convex, β-smooth, and L-Lipschitz, the regularizer be r(θ) = λ 2 θ 2 2 , and learning rate be η < 1 λ+β . Then Algorithm pair (A, Ā) satisfies a (q, ε dd )-data-deletion guarantee under all non-adaptive r-requesters for any noise variance σ 2 > 0 and K A ≥ 0 if KĀ ≥ 2 ηλ log 4qL 2 λε dd σ 2 n 2 . ( ) Proof. Following the preceding discussion, to prove this theorem, it suffices to show that the inequality (82) holds under the stated conditions. Consider the Fokker-Planck equation described in Lemma 26 for the pair of tracing diffusions SDEs in (86): at any time t in duration ηk < t ≤ η(k + 1) for any iteration 0 ≤ k < K A + iKĀ,    ∂ t µ t (θ) = div µ t (θ)E [g k (Θ ηk )|Θ t = θ] + σ 2 ∆µ t (θ), ∂ t µ t (θ) = div µ t (θ)E -g k (Θ ηk ) Θ t = θ + σ 2 ∆µ t (θ), where µ t and µ t are the distribution of Θ t and Θ t . Since (θ; x) is L-Lipschitz and for any K A + (i -1)KĀ ≤ k < K A + iKĀ we have D(k)[ind] = D (k)[ind], note from the definition of g k (θ) in (86) that E [g k (Θ ηk )|Θ t = θ] -E -g k (Θ ηk ) Θ t = θ 2 ≤ 2L n if k < K A + (i -1)KĀ 0 otherwise . Therefore, applying Lemma 29 to the above pair of Fokker-Planck equations gives that for any t in duration ηk < t ≤ η(k + 1), ∂ t R q (µ t µ t ) ≤ 2qL 2 σ 2 n 2 1 {t ≤ η(K A + (i -1)KĀ)} - qσ 2 2 I q (µ t µ t ) E q (µ t µ t ) . Equation ( 100) suggests a phase change in the dynamics at iteration k = K A + (i -1)KĀ. In phase I, the divergence bound increases with time due to the effect of the differing record in database pairs (D j , D j ) 0≤j≤i-1 . In phase II however, the update request u i makes D i-1 • u i = D i-1 • u i , and so doing gradient descent rapidly shrinks the divergence bound. This phase change is illustrated in the Figure 1 . For brevity, we denote R(q, t) = R q (µ t µ t ). Since η < 1 λ+β < 2 2λ+β , from Lemma 30, the distribution µ t satisfies LS inequality with constant λ(1 -λη/2). So, we can apply Lemma 23 to simplify the above partial differential inequality as follows. ∂ t R(q, t) + λ(1 -λη/2) R(q, t) q + (q -1)∂ q R(q, t) ≤ 2qL 2 σ 2 n 2 1 {t ≤ η(K A + (i -1)KĀ)} . (101) For brevity, let constant c 1 = λ(1 -λη/2) and constant c 2 = 2L 2 σ 2 n 2 . We define u(q, t) = R(q,t) q . Then, ∂ t R(q, t) + c 1 R(q, t) q + (q -1)∂ q R(q, t) ≤ c 2 q × 1 {t ≤ η(K A + (i -1)KĀ)} =⇒ ∂ t u(q, t) + c 1 u(q, t) + c 1 (q -1)∂ q u(q, t) ≤ c 2 × 1 {t ≤ η(K A + (i -1)KĀ)} . For some constant q > 1, let q(s) = (q -1) exp [c 1 {s -η(K A + iKĀ)}] + 1 and t(s) = s. Note that dq(s) ds = c 1 (q(s) -1) and dt(s) ds = 1. Therefore, for any ηk < s ≤ η(k + 1), the differential inequality followed along the path u(s) = u(q(s), t(s)) is du(s) ds + c 1 u(s) ≤ c 2 × 1 {t ≤ η(K A + (i -1)KĀ)} (102) =⇒ d ds {e c1s u(s)} ≤ c 2 × 1 {t ≤ η(K A + (i -1)KĀ)} . Since the map T k (•) in ( 85) is a differentiable bijection for η < 1 λ+β as per Lemma 28, note that Lemma 27 implies that lim s→ηk + u(s) = u(ηk). Therefore, we can directly integrate in the duration 0 ≤ t ≤ η(K A + iKĀ) to get [e c1s u(s)] η(KA+iK Ā ) 0 ≤ η(KA+(i-1)K Ā) 0 c 2 e c1s ds =⇒ e c1η(KA+iK Ā ) u(η(K p + iK u )) -u(0) ≤ c 2 c 1 [e c1η(KA+(i-1)K Ā) -1] =⇒ u(η(K A + iKĀ)) ≤ c 2 c 1 e -c1ηK Ā . (Since u(0) = R(q(0), 0)/q(0) = 0.) Noting that q(0) ≥ 1, on reverting the substitution, we get R q µ η(KA+iK Ā ) µ η(KA+iK Ā) ≤ 2qL 2 λσ 2 n 2 (1 -ηλ/2) exp (-ηλKĀ(1 -ηλ/2)) ≤ 4qL 2 λσ 2 n 2 exp - ηλK u 2 (Since η < 1 λ+β ) Recall from our construction that µ η(KA+iK Ā ) and µ η(KA+iK Ā ) are the distributions of Ā(D i-1 , u i , Θi-1 ) and Ā(D i-1 , u i , Θ i-1 ) respectively. Therefore, choosing KĀ as specified in the theorem statement concludes the proof. Our next goal in this section is to provide utility guarantees for the algorithm pair (A Noisy-GD , ĀNoisy-GD ) in form of excess empirical risk bounds. For that, we introduce some additional auxiliary results first. The following Lemma 32 shows that excess empirical risks does not increase too much on replacing r records in a database, and Lemma 33 provides a convergence guarantee on the excess empirical risk of Noisy-GD algorithm under convexity. Lemma 32. Suppose the loss function (θ; x) is convex, L-Lipschitz, and β-smooth, and the regularizer is r(θ) = λ 2 θ 2 2 . Then, the excess empirical risk of any randomly distributed parameter Θ for any database D ∈ X n after applying any edit request u ∈ U r that modifes no more than r records is bounded as err(Θ; D • u) ≤ 1 + β λ 2 err(Θ; D) + 16r 2 L 2 λn 2 . ( ) Proof. Let θ * D and θ * D•u be the minimizers of objectives L D (•) and L D•u (•) as defined in (??). From λ-strong convexity of the L D , L D (θ * D•u ) -L D (θ * D ) ≥ λ 2 θ * D•u -θ * D 2 2 . ( ) From optimality of θ * D•u and L-Lipschitzness of (θ; x), we have L D (θ * D•u ) = L D•u (θ * D•u ) + 1 n x∈D (θ * D•u ; x) - x∈D•u (θ * D•u ; x) ≤ L D•u (θ * D ) + 1 n x∈D (θ * D•u ; x) - x∈D•u (θ * D•u ; x) = L D (θ * D ) + 1 n x∈D ( (θ * D•u ; x) -(θ * D ; x)) + 1 n x∈D•u ( (θ * D ; x) -(θ * D•u ; x)) ≤ L D (θ * D ) + 2rL n θ * D•u -θ * D 2 . Combining the two inequalities give θ * D•u -θ * D 2 ≤ 4rL λn . Therefore, from (λ + β)-smoothness of L D•u and λ-strong convexity of L D , we have err(Θ; D • u) = E [L D•u (Θ) -L D•u (θ * D•u )] ≤ λ + β 2 E Θ -θ * D•u 2 2 ≤ (λ + β) E Θ -θ * D 2 2 + θ * D -θ * D•u 2 2 ≤ 1 + β λ 2E [L D (Θ) -L D (θ * D )] + 16r 2 L 2 λn 2 . Lemma 33 (Accuracy of Noisy-GD). For convex, L-Lipschitz, and, β-smooth loss function (θ; x) and regularizer r(θ) = λ 2 θ 2 2 , if learning rate η < 1 λ+β , the excess empirical risk of Θ ηK = Noisy-GD(D, Θ 0 , K) for any D ∈ X n is bounded as err(Θ ηK ; D) ≤ err(Θ 0 ; D)e -ληK/2 + 1 + β λ dσ 2 . ( ) Proof. Let Θ ηk denote the kth iteration parameter of Noisy-GD run. Recall that k + 1th noisy gradient update step is Θ η(k+1) = Θ ηk -η∇L D (Θ ηk ) + 2ησ 2 Z k . From (β + λ)-smoothness of L D , we have L D (Θ η(k+1) ) ≤ L D (Θ ηk ) + ∇L D (Θ ηk ), Θ η(k+1) -Θ ηk + β + λ 2 Θ η(k+1) -Θ ηk 2 2 = L D (Θ ηk ) -η ∇L D (Θ ηk ) 2 2 + 2ησ 2 ∇L D (Θ ηk ), Z k + η 2 (β + λ) 2 ∇L D (Θ ηk ) 2 2 + ησ 2 (β + λ) Z k 2 2 -η 2ησ 2 (β + λ) ∇L D (Θ ηk ), Z k On taking expectation over the joint distribution of Θ ηk , Θ η(k+1) , Z k , the above simplifies to  E L D (Θ η(k+1) ) ≤ E [L D (Θ ηk )]-η 1 - η(λ + β) 2 E ∇L D (Θ ηk ) ∇L D (θ) 2 2 ≥ 2λ(L D (θ) -L D (θ * D )). Let γ = λη(2 -η(λ + β)). Plugging this in the above inequality, and substracting L D (θ * D ) on both sides, for η < 1 λ+β , we get E L D (Θ η(k+1) ) -L D (θ * D ) ≤ (1 -γ)E [L D (Θ ηk ) -L D (θ * D )] + ηdσ 2 (β + λ) ≤ (1 -γ) k+1 E [L D (Θ 0 ) -L D (θ * )] + ηdσ 2 (β + λ)(1 + • • • + (1 -γ) k+1 ) ≤ e -γ(k+1)/2 E [L D (Θ 0 ) -L D (θ * D )] + ηdσ 2 (β + λ) γ . For η < 1 λ+β , note that γ ≥ λη, and so err(Θ ηK ; D) ≤ err(Θ 0 ; D)e -ληK/2 + 1 + β λ dσ 2 . ( ) Finally, we are ready to prove our main Theorem 6 showing that algorithm pair (A Noisy-GD , ĀNoisy-GD ) solves the data-deletion problem as described in Section 4. We basically combine the RDP guarantee in Theorem 25, non-adaptive data-deletion guarantee in Theorem 31, and prove excess empirical risk bound using Lemma 33 and Lemma 32. Theorem 6 (Accuracy, privacy, deletion, and computation tradeoffs). Let constants λ, β, L > 0, q > 1, and 0 < ε dd ≤ ε dp . Define constant κ = λ+β λ . Let the loss function (θ; x) be twice differentiable, convex, L-Lipschitz, and β-smooth, the regularizer be r(θ) = λ 2 θ 2 2 . If the learning rate be η = 1 2(λ+β) , the gradient noise variance is σ 2 = 4qL 2 λε dp n 2 , and the weight initialization distribution is ρ = N 0, σ 2 λ(1-ηλ/2)I d , then (1.) both A Noisy-GD and ĀNoisy-GD are (q, ε dp )-RDP for any K A , KĀ ≥ 0, (2.) pair (A Noisy-GD , ĀNoisy-GD ) satisfies (q, ε dd )-data-deletion all non-adaptive r-requesters if KĀ ≥ 4κ log ε dp ε dd , (3.) and all models in ( Θi ) i≥0 produced by (A Noisy-GD , ĀNoisy-GD , Q) on any D 0 ∈ X n , where Q is any r-requester, have an excess empirical risk err( Θi ; D i ) = O qd ε dp n 2 if K A ≥ 4κ log ε dp n 2 4qd , and KĀ ≥ 4κ log max 5κ, 8ε dp r 2 qd . (113) Proof. (1.) Privacy. By Theorem 25, the Noisy-GD with K iterations will be (q, ε dp )-RDP for the stated choice of loss function, regularizer, and learning rate as long as σ 2 ≥ 4qL 2 λε dp n 2 1 -e -ληK/2 . Therefore, if we set σ 2 = 4qL 2 λε dp n 2 , Noisy-GD is (q, ε dp )-RDP for any K. For the same σ 2 , both A Noisy-GD and ĀNoisy-GD are also (q, ε dp )-RDP for any K A and KĀ as they run Noisy-GD on respective databases for generating the output. (2.) Deletion. By Theorem 31, for the stated choice of loss function, regularizer, learning rate, and weight initialization distribution, the algorithm pair (A Noisy-GD , ĀNoisy-GD ) satisfies (q, ε dd )data-deletion under all non-adaptive r-requesters Q if KĀ ≥ 2 ηλ log 4qL 2 λε dd σ 2 n 2 . By plugging in σ 2 = 4qL 2 λε dp n 2 and η = 1 2(λ+β) , this constraint simplifies to KĀ ≥ 4κ log ε dp ε dd . (3.) Accuracy. We prove the induction hypothesis that under the conditions stated in the theorem, err( Θi ; D i ) ≤ 10κqdL 2 λε dp n 2 for all i ≥ 0. Base case: The minimizer θ * D0 of L D0 satisfies ∇L D0 (θ * D0 ) = 1 n x∈D0 ∇ (θ * D0 ; x) -λθ * D0 = 0 =⇒ θ * D0 2 ≤ L λ . ( ) As a result, the excess empirical risk of initialization weights Θ 0 ∼ ρ = N 0, ) Induction step: Assume that err( Θi-1 ; D i-1 ) ≤ 10κqdL 2 λε dp n 2 . Since Θi = ĀNoisy-GD (D i-1 , u i , Θi-1 ) = Noisy-GD(D i , Θi-1 , KĀ), by Lemma 33 and Lemma 32, running KĀ ≥ 2κ log max 5κ, 8r 2 qd iterations gives σ 2 λ(1-ηλ/2)I d on L D0 is bounded as err(Θ 0 ; D 0 ) = E L D0 (Θ 0 ) -L D0 (θ * D0 ) ≤ (λ + β) 2 E Θ 0 -θ * D0 2 2 (From (λ + β)-smoothness of L D0 ) = (λ + β) 2 θ * D0 2 2 + E Θ 0 2 2 -2E θ * D0 , Θ 0 ≤ 1 + β λ L 2 2λ + σ 2 d 2 -λη (From (114) and E Z 2 2 = d if Z ∼ N (0, I d ).) ≤ κ L 2 2λ + dσ 2 . Since Θ0 = A Noisy-GD (D 0 ) = Noisy-GD(D 0 , Θ 0 , K A ), err( Θi ; D i ) ≤ κ 2err( Θi-1 ; D i-1 ) + 16r 2 L 2 λn 2 e -ληK Ā/2 + κdσ 2 ≤ κ 20κqdL 2 λε dp n 2 + 16r 2 L 2 λn 2 e -ληK Ā /2 + 4κqdL 2 λε dp n 2 (Substituting σ 2 ) ≤ 16κr 2 L 2 λn 2 e -ληK Ā /2 + 8κqdL 2 λε dp n 2 (Since KĀ ≥ 4κ log(5κ)) ≤ 10κqdL 2 λε dp n 2 (Since KĀ ≥ 4κ log 8ε dp r 2 qd ) G.5 PROOFS FOR SUBSECTION 5.2 In this Appendix, we provide a proof of our data-deletion and utility guarantee in Theorem 7 which applies to non-convex but bounded losses (θ; x) under L2 regularizer r(θ). Suppose D 0 ∈ X n is an arbitrary database, Q is any non-adaptive r-requester, and ( Θi ) i≥0 is the model sequence generated by the interaction of (A Noisy-GD , ĀNoisy-GD , Q). Our first goal will be to prove (q, ε dd )-data deletion guarantee on (A Noisy-GD , ĀNoisy-GD ) and we will later use it for arguing utility as well. Recall from Definition 4.1 that to prove (q, ε dd )-data-deletion, we need to construct a map π Q i : X n → O such that for all i ≥ 1 and any u i ∈ U r , R q Ā(D i-1 , u i , Θi-1 ) π Q i (D 0 • ind, y ) ≤ ε dd for all ind, y ∈ u i . ( ) Our construction of π u i for this proof is completely different from the one described in Appendix G.4. As discussed in Remark 3, since Q is non-adaptive, it suffices to show that there exists a map π : X n → O such that for all i ≥ 1, R q Ā(D i-1 , u i , Θi-1 ) π(D i ) ≤ ε dd , for all D 0 ∈ X n and all edit sequences (u i ) i≥1 from U r . Our mapping of choice for the purpose is the Gibbs distribution with the following density: π(D)(θ) ∝ e -L D (θ)/σ 2 . ( ) The high-level intuition for this construction is that Noisy-GD can be interpreted as Unadjusted Langevin Algorithm (ULA) (Roberts & Tweedie, 1996) , which is a discretization of the Langevin diffusion (described in eqn. ( 63)) that eventually converges to this Gibbs distribution (see Appendix G.1 for a quick refresher). However, showing a convergence for ULA (in indistinguishability notions like Rényi divergence) to this Gibbs distribution, especially in form of non-asymptotic bounds on the mixing time and discretization error has been a long-standing open problem. Recent breakthrough results by Vempala & Wibisono (2019) followed by Chewi et al. (2021) resolved this problem with an elegant argument, relying solely on isoperimetric assumptions over (117) that hold for non-convex losses. Our data-deletion argument leverages this rapid convergence result to basically show that once Noisy-GD reaches near-indistinguishability to its Gibbs mixing distribution, maintaining indistinguishability to subsequent Gibbs distribution corresponding to database edits require much fewer Noisy-GD iterations than fresh retraining (i.e. data deletion is faster than retraining). We start by presenting Chewi et al. (2021) 's convegence argument adapted to our Noisy-GD formulation, with a slighlty tighter analysis that results in a log(q) improvement in the discretization error over the original. Consider the discrete stochastic process (Θ ηk ) 0≤k≤K induced by parameter update step in Noisy-GD algorithm when run for K iterations on a database D with an arbitrary start distribution Θ 0 ∼ µ 0 . We interpolate each discrete update from Θ ηk to Θ η(k+1) via a diffusion process Θ t defined over time ηk ≤ t ≤ η(k + 1) as Θ t = Θ ηk -(t -ηk)∇L D (Θ ηk ) + √ 2σ 2 (Z t -Z ηk ), where Z t is a Weiner process. Note that if Θ ηk models the parameter distribution after the k th update, then Θ η(k+1) models the parameter distribution after the k + 1 th update. On repeating this construction for all k = 0, • • • , K, we get a tracing diffusion {Θ t } t≥0 for Noisy-GD (which is different from ( 84)). We denote the distribution of random variable Θ t with µ t . The tracing diffusion during the duration ηk ≤ t ≤ η(k + 1) is characterized by the following Fokker-Planck equation. Lemma 34 (Proposition 14 (Chewi et al., 2021) ). For tracing diffusion Θ t defined in (118), the equivalent Fokker-Planck equation in the interval ηk ≤ t ≤ η(k + 1) is ∂ t µ t (θ) = div E [∇L D (Θ ηk ) -∇L D (Θ t )|Θ t = θ] + σ 2 ∇ log µ t (θ) π(D)(θ) µ t (θ) , where π(D) is the Gibbs distribution defined in (117). Proof. Conditioned on observing parameter Θ ηk = θ ηk , the process (Θ t ) ηk≤t≤η(k+1) is a Langevin diffusion along a constant Vector field ∇L D (θ ηk ). Therefore, the conditional probability density µ t|ηk (•|θ ηk ) of Θ t given θ ηk follows the following Fokker-Planck equation. ∂ t µ t|ηk (•|θ ηk ) = σ 2 ∆µ t|ηk (•|θ ηk ) + div µ t|ηk (•|θ ηk )∇L D (θ ηk ) Taking expectation over Θ ηk , we have ∂ t µ t (•) = µ ηk (θ ηk ) σ 2 ∆µ t|ηk (•|θ ηk ) + div µ t|ηk (•|θ ηk )∇L D (θ ηk ) dθ ηk = σ 2 ∆µ t (•) + div (µ t (•)∇L D (•)) + div µ t (•) [∇L D (θ ηk ) -∇L D (•)] µ ηk|t (θ ηk |•)dθ ηk = σ 2 div µ t (•)∇ log µ t (•) π(D)(•) + div E [∇L D (Θ ηk ) -∇L D (•)|Θ t = •] µ t (•) , where µ ηk|t is the conditional density of Θ ηk given Θ t . For the last equality, we have used the fact that ∇L D = -σfoot_2 ∇ log π(D) from (117). The following lemma provides a partial differential inequality that bounds the rate of change in Rényi divergence R q (µ t π(D)) using Fokker-Planck equation ( 119) of Noisy GD's tracing diffusion. Lemma 35 (Proposition 15 (Chewi et al., 2021) ). Let ρ t := µ t /π(D) where π(D) is the Gibbs distribution defined in (117) and ψ t := ρ q-1 t /E q (ρ t π(D)). The rate of change in R q (µ t π(D)) along racing diffusion in time ηk ≤ t ≤ η(k + 1) is bounded as ∂ t R q (µ t π(D)) ≤ - 3qσ 2 4 I q (µ t π(D)) E q (µ t π(D)) + q σ 2 E ψ t (Θ t ) ∇L D (Θ ηk ) -∇L D (Θ t ) Proof. For brevity, let ∆ t (•) = E [∇L D (Θ ηk ) -∇L D (Θ t )|Θ t = •] in context of this proof. From Lebinz integral rule, we have ∂ t R q (µ t π(D)) = q (q -1)E q (µ t π(D)) µ t π(D) q-1 ∂ t µ t dθ = q (q -1)E q (µ t π(D)) ρ q-1 t div ∆ t + σ 2 ∇ log ρ t µ t dθ (From (119)) = - q (q -1)E q (µ t π(D)) ∇ ρ q-1 t , ∆ t + σ 2 ∇ log ρ t µ t dθ = - q E q (µ t π(D)) ρ q-2 t ∇ρ t , ∆ t + σ 2 ∇ρ t ρ t µ t dθ = - q E q (µ t π(D))            σ 2 I q (µ t π(D)) + 2 q E µt ρ q/2-1 t ∇ ρ q/2 t , ∆ t def =F1            (From (40)) Note that the expectation in ∆ t (•) is over the conditional distribution µ ηk|t while the expectation in F 1 is over µ t . Therefore, we can combine the two to get an expectation over the unconditional joint distribution over Θ t and Θ ηk as follows. - F 1 = E Θt∼µt ρ q/2-1 t (Θ t ) ∇ ρ q/2 t (Θ t ), E Θ ηk ∼µ ηk|t [∇L D (Θ t ) -∇L D (Θ ηk )] = E µ ηk,t ρ q/2-1 t (Θ t ) ∇ ρ q/2 t (Θ t ), ∇L D (Θ t ) -∇L D (Θ ηk ) ≤ σ 2 2q E ρ -1 t (Θ t ) ∇ ρ q/2 t (Θ t ) 2 2 + q 2σ 2 E ρ q-1 t (Θ t ) ∇L D (Θ t ) -∇L B k (Θ ηk ) 2 2 = qσ 2 8 I q (ρ t µ) + q 2σ 2 E ρ q-1 t (Θ t ) ∇L D (Θ t ) -∇L B k (Θ ηk ) 2 2 (From (40)) Substituting it in the preceding inequality proves the proposition. We need to solve the PDI (121) to get a convergence bound for Noisy-GD. To help in that, we first introduce the change of measure inequalities shown in Chewi et al. (2021) . Lemma 36 (Change of measure inequality (Chewi et al., 2021) ). If (θ; x) is β-smooth, and regularizer is r(θ) = λ 2 θ 2 2 , then for any probability density µ on R d , E µ ∇L D 2 2 ≤ 4σ 4 E π(D) ∇ µ π(D) 2 2 + 2dσ 2 (β + λ), where π(D) is the Gibbs distribution defined in (117). Proof. Consider the Langevin diffusion (63) described in Appendix G.1 over the potential L D . The Gibbs distribution π(D) is its stationary distribution, and the diffusion's infintesimal generator G applied on the L D gives GL D = σ 2 ∆L D -∇L D 2 2 . Therefore, E µ ∇L D 2 2 = σ 2 E µ [∆L D ] -E µ [GL D ] (From (123)) ≤ dσ 2 (β + λ) -GL D µ π(D) -1 π(D)dθ (From β-smoothness and ( 71))  = dβσ 2 (β + λ) + ∇L D 2 2 -σ 2 ∆L D µ π(D) -1 π(D)dθ = dβσ 2 (β + λ) + ∇L D 2 2 (µ -π(D))dθ + σ 2 ∇L D , ∇ µ π(D) -1 π(D) dθ (From (19)) = dβσ 2 (β + λ) + ∇L D 2 2 (µ -π(D))dθ + σ 2 ∇L D , - ∇L D σ 2 (µ -π(D))dθ + σ 2 ∇L D , ∇ µ π(D) π(D)dθ (Since ∇π(D) = -∇L D σ 2 π(D)) = dβσ 2 (β + λ) + 0 + 2σ 2 µ π(D) ∇L D , ∇ µ π(D) π(D)dθ ≤ dβσ 2 (β + λ) + 1 2 E µ ∇L D 2 2 + 2σ 4 E π(D) ∇ µ π(D) We are now ready to prove the rate of convergence guarantee for Noisy-GD following Chewi et al. (2021)'s method, but with a more refined analysis that leads to a improvement of log q factor in the discretization error (compared to the original (Chewi et al., 2021, Theorem 4) ). Theorem 38 (Convergence of Noisy-GD in Rényi divergence). Let constants β, λ, σ 2 ≥ 0 and q, B > 1. Suppose the loss function (θ; x) is (σ 2 log(B)/4)-bounded and β-smooth, and regularizer is r (θ) = λ 2 θ 2 2 . If step size is η ≤ λ 64Bq 2 β+λ) 2 , then for any database D ∈ X n and any weight initialization distribution µ 0 for Θ 0 , the Rényi divergence of distribution µ ηK of output model Θ ηK = Noisy-GD(D, Θ 0 , K) with respect to the Gibbs distribution π(D) defined in (117) shrinks as follows: R q (µ ηK π(D)) ≤ q exp - ληK 2B R q (µ 0 π(D)) + 32dηqB(β + λ) 2 λ . Proof. From (β + λ)-smoothness of loss L D we have that for any ηk ≤ t ≤ η(k + 1), ∇L D (Θ ηk ) -∇L D (Θ t ) 2 2 ≤ (β + λ) 2 Θ ηk -Θ t 2 2 = (β + λ) 2 (t -ηk)∇L D (Θ ηk ) -2(t -ηk)σ 2 Z k 2 2 (From (118)) ≤ 2η 2 (β + λ) 2 ∇L D (Θ ηk ) 2 2 + 4ησ 2 (β + λ) 2 Z k 2 2 ≤ 4η 2 (β + λ) 2 ∇L D (Θ ηk ) -∇L D (Θ t ) 2 2 + 4η 2 (β + λ) 2 ∇L D (Θ t ) 2 2 + 4ησ 2 (β + λ) 2 Z k 2 2 Let ρ t := µt π(D) and ψ t := ρ q-1 t /E q (ρ t π(D)). If η ≤ 1 2 √ 2(β+λ) , we rearrange to get the following and use it to get the following bound on the discretization error in ( 121): E ψ t (Θ t ) ∇L B k (Θ ηk ) -∇L D (Θ t ) 2 2 ≤ 8η 2 (β + λ) 2 E ψ t (Θ t ) ∇L D (Θ t ) 2 2 def =F1 + 32ησ 2 (β + λ) 2 E ψ t (Θ t ) Z k 2 2 /4 def =F2 . Hence, for solving the PDI (121), we have to bound the three expectations F 1 and F 2 . 1. Bounding F 1 . Note that E Θt∼µt [ψ t (Θ t )] = ψ t (θ)µ t (θ)dθ = 1 Eq(ρt π(D)) µ q t π(D) q-1 dθ = 1. So, ψ t µ t (θ) := ψ t (θ)µ t (θ) is a probablitly density function on R d . On applying the measure change Lemma 36 on it, we get F 1 = E ψtµt ∇L D 2 2 ≤ 4σ 4 E π(D)   ∇ ψ t µ t π(D) 2 2   + 2dσ 2 (β + λ) (From (122)) = 4σ 4 E π(D)    ∇ ρ q t 2 2 E q (µ t π(D))    + 2dσ 2 (β + λ) = σ 4 q 2 I q (µ t π(D)) E q (µ t π(D)) + 2dσ 2 (β + λ). (From (40)) 2. Bounding F 2 . Since ψ t µ t is a valid density on R d , the joint density ψ t µ t,z (θ, z) := ψ t (θ)µ t,z (θ, z) where µ t,z is the joint density of Θ t and Z k is also a valid density. Note that the F 2 is an expectation on Z k 2 2 taken over the joint density ψ t µ t,z . We can perform a measure change operation using Donsker-Vardhan principle to get F 2 = E ψtµt,z Z k 2 2 /4 ≤ KL (ψ t µ t,z µ t,z ) + log E µz exp( Z k 2 2 /4) , where we simplified the second term using the fact that the marginal µ z of µ t,z is a standard normal Gaussian. The random variable Z k 2 2 is distributed according to the Chi-squared distribution χ 2 d with d degrees of freedom. Since the moment generating function of Chi-squared distribution is M χ 2 d (t) = E X∼χ 2 d [exp(tX)] = (1 -2t) -d/2 for t < 1 2 , we can simplify the second term in F 2 as log E µz exp( Z k 2 2 /4) = log M χ 2 d 1 4 = d log 2 2 . ( ) The KL divergence term can be simplified as follows.

KL (ψ

t µ t,z µ t,z ) = ψ t µ t,z (θ t , z) log ψ t (θ t )dθ t dz = ψ t µ t log ρ q-1 t E q (µ t π(D)) dθ t (On marginalization of z) = q -1 q µ t ψ t log ρ q t E q (µ t π(D)) -log E q (µ t π(D)) 1/(q-1) dθ t = q -1 q {KL (µ t ψ t π(D)) -R q (µ t π(D))} ≤ KL (µ t ψ t π(D)) (Since R q (µ t π(D)) > 0) Note that under the assumptions of the Theorem, π(D) satisfies log-Sobolev inequality (74) with constant λ/B (i.e. satisfies LS(λ/B)). To see this, recall from Lemma 18 that the Gaussian distribution ρ(θ) = N 0, σ 2 λ I d satisfies LS(λ) inequality. Since loss (θ; x) is (σ 2 log(B)/4)-bounded, the density ratio π(D)(θ) ρ(θ) ∈ 1 √ B ,

√

B . The claim therefore follows from Lemma 20. Using this inequality, from Lemma 21 we have KL (µ t ψ t π(D)) ≤ σ 2 B 2λ µ t ψ t ∇ log µ t ψ t π(D) 2 2 dθ t = σ 2 B 2λ ρ q t E q (µ t π(D)) ∇ log(ρ q t ) 2 2 π(D)dθ t = 2σ 2 B λ 1 E q (µ t π(D)) ∇(ρ q/2 t ) 2 2 π(D)dθ t = q 2 σ 2 B 2λ I q (µ t π(D)) E q (µ t π(D)) On combining all the two bounds on F 1 and F 2 and rearranging, we have E ψ t (Θ t ) ∇L D (Θ ηk ) -∇L D (Θ t ) 2 2 ≤ 8ηq 2 σ 4 (β + λ) 2 I q (µ t π(D)) E q (µ t π(D)) η + 2B λ + 16ηdσ 2 (β + λ) 2 (η(β + λ) + log 2) Let step size be η ≤ min 2B λ , λ 64Bq 2 (β+λ) 2 . Then, the first term above is bounded as 8ηq 2 σ 4 (β + λ) 2 I q (µ t π(D)) E q (µ t π(D)) η + 2B λ ≤ σ 4 2 I q (µ t π(D)) E q (µ t π(D)) . Let η ≤ 1 4(β+λ) . Then, in the third term, (η(β + λ) + log 2) ≤ 1. Plugging the bound on discretization error back in the PDI (121), we get ∂ t R q (µ t π(D)) ≤ - qσ 2 4 I q (µ t π(D)) E q (µ t π(D)) + 16ηdq(β + λ) 2 . ( ) Since π(D) satisfies LS(λ/B) inequality, from Lemma 23 this PDI reduces to ∂ t R q (µ t π(D)) + λ 2B R q (µ t π(D)) q + (q -1)∂ q R q (µ t π(D)) ≤ 16dηq(β + λ) 2 . ( ) Let c 1 = λ 2B and c 2 = 16dη(β + λ) 2 . Additionally, let u(q, t) = Rq(µt π(D)) q . Then, ∂ t R q (µ t π(D)) + c 1 R q (µ t π(D)) q + (q -1)∂ q R q (µ t π(D)) ≤ c 2 q =⇒ ∂ t R q (µ t π(D)) q + c 1 R q (µ t π(D)) q + c 1 (q -1) ∂ q R q (µ t π(D)) q -R q (µ t π(D)) q 2 ≤ c 2 =⇒ ∂ t u(q, t) + c 1 u(q, t) + c 1 (q -1)∂ q u(q, t) ≤ c 2 . For some constant q ≥ 1, let q(s) = (q -1) exp(c 1 (s -ηK)) + 1, and t(s) = s. Note that dq(s) ds = c 1 (q(s) -1) and dt(s) ds = 1. Therefore, for any 0 ≤ t ≤ ηK, the PDI above implies the following differential inequality is followed along the path u(s) = u(q(s), t(s)). (1 -e -c1ηK ). On reversing the parameterization of q and t, we get R q(ηK) (µ ηK π(D)) ≤ q(ηK) q(0) e -c1ηK R q(0) (µ 0 π(D)) + c 2 c 1 q(ηK) ≤ q(ηK) q(0) exp -ληK 2B R q(0) (µ 0 π(D)) + 32dηB(β + λ) 2 λ q(ηK). Since q(0) > 1 and q = q(ηK) > q(0), from monotonicity of Rényi divergence in q, we get R q (µ ηK π(D)) ≤ q exp -ληK 2B R q (µ 0 π(D)) + 32dη qB(β + λ) 2 λ . Finally, noting that for constants B, q > 1 and β, λ ≥ 0, step size η ≤ min{ We will use Theorem 38 for proving the data-deletion and utility gaurantee on (A Noisy-GD , ĀNoisy-GD ). We need the following result that shows that Gibbs distributions enjoy strong indistinguishability on bounded perturbations to its potential function (which is basically why the exponential mechanism satisfies (ε, 0)-DP (Wang et al., 2015; Dwork et al., 2014) ). Lemma 39 (Indistinguishability under bounded perturbations). For two potential functions L, L : R d → R and some constant σ 2 , let ν ∝ e -L/σ 2 and ν ∝ e -L /σ 2 be the respective Gibbs distributions. If |L(θ) -L (θ)| ≤ c for all θ ∈ R d , then R q (ν ν ) ≤ 2c σ 2 for all q > 1. Proof. The Gibbs distributions ν, ν have a density ν(θ) = 1 Λ e -L(θ)/σ 2 , and ν (θ) = 1 Λ e -L (θ)/σ 2 , where Λ, Λ are the respective normalization constants. If for all θ ∈ R d , the potential difference |L(θ) -L (θ)| ≤ c, then R q (ν ν ) = 1 q -1 log ν q ν q-1 dθ = 1 q -1 log Λ Λ q-1 exp q -1 σ 2 (L (θ) -L(θ)) × ν(θ)dθ ≤ 1 q -1 (q -1) log Λ Λ + log exp c(q -1) σ 2 νdθ = 1 q -1    (q -1) log exp -L(θ) σ 2 + L(θ)-L (θ) σ 2 dθ exp -L(θ) σ 2 dθ + c(q -1) σ 2    ≤ 2c σ 2 . In Theorem 7, we show that (A Noisy-GD , ĀNoisy-GD ) solves the data-deletion problem described in Section 4 even for non-convex losses. Our proof uses the convergence Theorem 38 and indistinguishability for bounded perturbation Lemma 39 to show that the data-deletion algorithm ĀNoisy-GD can consistently produce models indistinguishable to the corresponding Gibbs distribution (117) in the online setting at a fraction of computation cost of A Noisy-GD . As discussed in Remark 3, such an indistinguishability is sufficient for ensuring data-deletion for non-adaptive requests. As for adaptive requests, the well-known RDP guarantee of Abadi et al. (2016) combined with our reduction Theorem 5 offers a data-deletion guarantee for (A Noisy-GD , ĀNoisy-GD ) under adaptivity. Our proof of accuracy for the data-deleted models leverages the fact that Gibbs distribution ( 117) is an almost excess risk minimizer as shown in the following Theorem 40. Since our data-deletion guarantee is based on near-indistinguishability to (117), this property also ensures near-optimal excess risk of data-deleted models. Theorem 7 (Accuracy, privacy, deletion, and computation tradeoffs). Let constants λ, β, L, σfoot_3 , η > 0, q, B > 1, and 0 < ε dd ≤ ε dp < d. Let the loss function (θ; x) be σ 2 log(B) (1.) both A Noisy-GD and ĀNoisy-GD are (q, ε dp )-RDP for any η ≥ 0 and any K A , KĀ ≥ 0 if σ 2 ≥ qL 2 ε dp n 2 • η max{K A , KĀ}, (2.) pair (A Noisy-GD , ĀNoisy-GD ) satisfy (q, ε dd )-data-deletion under all non-adaptive r-requesters for any σ 2 > 0, if learning rate is η ≤ λε dd 64dqB(β+λ) 2 and number of iterations satisfy Proof. (1.) Privacy. By Theorem 24, Noisy-GD with K iterations on an L-Lipschitz loss function satisfies (q, ε dp )-RDP for any initial weight distribution ρ and learning rate η ≥ 0 if σ 2 = qL 2 ε dp n 2 • ηK. Since, both A Noisy-GD and ĀNoisy-GD run Noisy-GD for K A and KĀ iterations respectively, setting the noise variance given in the Theorem statement ensures (q, ε dp )-RDP for both. K A ≥ 2B λη log q log(B) ε dd , KĀ ≥ K A - 2B λη log log(B) 2 ε dd + r n log(B) , (2.) Deletion. For showing data-deletion under non-adaptive requests, recall that it is sufficient to show that there exists a map π : X n → O such that for all i ≥ 1, R q Ā(D i-1 , u i , Θi-1 ) π(D i ) ≤ ε dd , for all edit sequences (u i ) i≥1 from U r , where ( Θi ) i≥0 is the sequence of models generated by the interaction of (A Noisy-GD , ĀNoisy-GD , Q) on any database D 0 ∈ X n . For all i ≥ 0, let μi denote the distribution of Θi . We prove (137) via induction. Base step: Note that the initial weight distribution ρ = N 0, σ 2 λ I d has a density proportional to e -r(θ)/σ 2 and the distribution π(D 0 ) has a density proportional to e -L D 0 (θ)/σ 2 . Since both of these are Gibbs distributions with their potential difference |L D0 (θ) -r(θ)| ≤ σ 2 log(B)/4 for all θ ∈ R d due to boundedness assumption on (θ; x), we have from Lemma 39 that R q (ρ π(D 0 )) ≤ 2 σ 2 × σ 2 log(B) 4 = log(B) 2 . ( ) Under the stated assumptions on loss (θ; x) and learning rate η, note that the convergence Theorem 38 holds. Since Θ0 = A Noisy-GD (D 0 ) = Noisy-GD(D 0 , Θ 0 , K A ), where Θ 0 ∼ ρ, we have R q (μ 0 π(D 0 )) ≤ q exp - ληK A 2B R q (ρ π(D 0 )) + 32dηqB(β + λ) 2 λ ≤ q exp - ληK A 2B log(B) 2 + ε dd 2 (Since η ≤ λε dd 64dqB(β+λ) 2 ) ≤ ε dd (Since K A ≥ 2B λη log q log(B) ε dd ) Induction step: Suppose R q (μ i-1 π(D i-1 )) ≤ ε dd . Again, from boundedness of (θ; x), we have |L Di-1 (θ) -L Di (θ)| ≤ rσ 2 log B 2n for all θ ∈ R d . Therefore, from Lemma 39 we have for all q > 1 that R q (π(D i-1 ) π(D i )) ≤ r log(B) n . (139) So from the weak triangle inequality Theorem 17 of Rényi divergence, R q (μ i-1 π(D i )) ≤ R q (μ i-1 π(D i-1 )) + R ∞ (π(D . Since Θi = ĀNoisy-GD (D i-1 , u i , Θi-1 ) = Noisy-GD(D i , Θi-1 , KĀ), from convergence Theorem 38 we have R q (μ i π(D i )) ≤ q exp -ληKĀ 2B R q (μ i-1 π(D i )) + 32dηqB(β + λ) ) Hence, by induction, R q (μ i π(D i )) ≤ ε dd holds for all i ≥ 0.  The second term is the suboptimality of Gibbs distribution and by Theorem 40, it is bounded as E L Di ( Θi ) -L Di (θ * Di ) ≤ dσ 2 2 log β + λ λ + √ B . Due to L-Lipschitzness and (λ + β)-smoothness of L Di , for any coupling Π of Θi and Θi , the first term is bounded as E L Di ( Θi ) -L Di ( Θi ) ≤ E Π ∇L Di ( Θi ), Θi -Θi + λ + β 2 Θi -Θ i 2 2 ≤ L E Π Θi -Θi 2 2 + λ + β 2 E Π Θi -Θi 2 2 . (From Jensen's inequality) Recall that the distribution π(D) satisfies LS(λ/B) inequality. On choosing the coupling Π to be the infimum, we get the following bound on Wasserstein's distance from Lemma 22. inf Π E Θi, Θi∼Π Θi -Θi 2 2 = W 2 Θi , Θi ≤ 2Bσ 2 λ KL (µ i π(D i )) ≤ 2ε dd Bσ 2 λ . The last inequality above follows from monotonicity of Rényi divergence in q and the fact that lim q→1 R q (ν ν ) = KL (ν ν ). Therefore, on combining all the bounds we get err( Θ; D) ≤ Lσ 2ε dd B λ + ε dd Bσ 2 (λ + β) λ + dσ 2 2 log β + λ λ + √ B = O σ √ ε dd + dσ 2 . (144) Note that if the constraints on K A and KĀ in (136) and on σ 2 in (135) are equalities instead, we have σ 2 = 2qBL 2 λε dp n 2 log q log(B) ε dd = Õ q ε dp n 2 , ( ) where Õ(•) hides logarithmic factors. Therefore, the excess empirical risk has an order err( Θ; D) = Õ 1 n qε dd ε dp + dq ε dp n 2 . (146)



We consider replacement instead of separate addition/deletion to ensure that database size doesn't change. Definition 2.6 is stronger than the adaptive unlearning definition ofGupta et al. (2021) since theirs require satisfying only one-sided indistinguishability with at-least (1 -γ) probability over generated edit requests u ≤i . . (121) 2 , and the weight initialization distribution be ρ = N 0, σ 2 λ I d . Then,



Θ if, for all O ⊂ O, P [Θ ∈ O] ≤ e ε P [Θ ∈ O] + δ and P [Θ ∈ O] ≤ e ε P [Θ ∈ O] + δ.

.e., the minimization objective is L D (θ) = 1 n x∈D (θ; x) + r(θ). The excess empirical risk of a model Θ for D by is defined as err(Θ; D) = E [L D (Θ) -L D (θ * D )], where θ * D = arg min θ∈O L D (θ) and expectation is taken over Θ.

12) and (11) are equalities.Proof of Theorem 7 can be found in Appendix G.5. The Rényi DP result in (1.) is a restatement ofAbadi et al. (2016, Theorem 1)  (discussed further in Appendix G.3. We prove the deletion and utility results in (2.) and (3.) by building on recent rapid convergence results for Noisy-GD byVempala & Wibisono (2019);Chewi et al. (2021). Specifically, we show bounds on Rényi divergence of generated models Θi w.r.t. Gibbs distribution π(D i ) ∝ exp(-L Di /σ 2 ) for all i ≥ 0.

);Sekhari  et al. (2021);Ginart et al. (2019). Definition E.1 ((ε, δ)-certified removal(Guo et al., 2019)). A removal mechanism Ā performs (ε, δ)-certified removal for learning algorithm A if for all databases D ⊂ X and deletion subset S ⊂ D, Ā(D, S, A(D))ε,δ ≈ A(D \ S).

) where ∅ denotes the empty set and T (D) denotes the data statistics available to Ā regarding D. Definition E.3 (Data Deletion Operation Ginart et al. (2019)). Fix any dataset D ⊂ X and learning algorithm A. Operation Ā is a deletion operation for A if Ā(D, S, Ā(D)) 0,0 ≈ A(D \ S) for any set S ⊂ D selected independently of A(D).

any D ⊂ X and any S ⊂ D, the above algorithm pair (A, Ā) satisfies Definitions E.1, E.2 and E.3 for ε = δ = 0 and T (D) = D. Suppose the adversary is aware that the following dependence holds between the learned model A(D) and deletion request S:

Consider two neighbouring databases D -1 = {-2, -1, 2} and D 1 = {-2, 1, 2}. Knowing the above dependence, an adversary can determine whether D = D -1 or D = D 1 by looking only at Ā(D, S, A(D)). This is because if D = D -1 , then the observation after unlearning is 2, and if D = D 1 , the observation after unlearning is -2. So, even though (A, Ā) satisfies the guarantees ofGuo et al. (2019) andSekhari et al. (2021), it blatantly reveals the identity (-1 or 1) of a deleted record to an adversary observing only the post-deletion release.

must be small. Since after processing edit request u i , the databases D i , D i+1 , • • • no longer contain the deleted record D i-1 [ind], the data-processing inequality implies that future models Θi+1 , Θi+2 , • • • cannot have more information about D i-1[ind]  that what is present in Θi . Therefore, any test MI( Θi , Θi+1 , • • • ) has a smaller advantage than the optimal test MI * ( Θi ) ∈ {0, 1} that only uses Θi .

This requester Q transforms any initial database D 0 to D n-1 = D if the outcome Test( Θ0 ) = 1, otherwise to D n-1 = D . Consider an adversary that does not observe the interaction transcript ( Θ<n ; u <n ), but is interested in identifying whether D 0 was D or D . The adversary gets to observe only the output Θn = Ā(D n-1 , u n , Θn-1 ) generated after processing the edit request u n = ū. On this observation, the adversary runs the membership inference test MI( Θn ) = Test( Θn ). The membership inference advantage of MI is Adv(MI; D, D ) def = P MI( Θn ) = 1|D 0 = D -P MI( Θn ) = 1|D 0 = D = b∈{0,1} P Test( Θn ) = 1|Test( Θ0 ) = b × P Test( Θ0 ) = b|D 0 = D -b∈{0,1} P Test( Θn ) = 1|Test( Θ0 ) = b × P Test( Θ0 ) = b|D 0 = D = P Test( Θn ) = 1|D n-1 = D -P Test( Θn ) = 1|D n-1 = D Adv(Test; D, D ) = Adv(Test; D, D , ū) × Adv(Test; D, D ) > δ 2 .

+ηdσ 2 (β+λ). (109) Let θ * D = arg min θ∈R d L D (θ). From λ-strong convexity of L D , for any θ ∈ R d , we have

20) with a = 2σ 2 ) Another change in measure inequality needed for the proof is the Donkser-Vardhan variational principle. Lemma 37 (Donsker-Vardhan Variational principle (Donsker & Varadhan, 1983)). If ν and ν are two distributions on R d such that ν ν , then for all functions f : R d → R, E θ∼ν [f (θ)] ≤ KL (ν ν ) + log E θ ∼ν [exp(f (θ ))] .

u(s) ≤ c 2 =⇒ d ds {e c1s u(s)} ≤ c 2 e c1s =⇒ [e c1s u(s)] ηK 0 ≤ ηK 0 c 2 e c1s ds =⇒ e c1ηK u(ηK) -u(0) ≤ c 2 (e c1ηK -1) c 1 =⇒ u(ηK) ≤ e -c1ηK u(0) + c 2 c 1

β+λ) 2 completes the proof.

Near optimality of Gibbs sampling). If the loss function (θ; x) is σ 2 log(B)/4bounded and β-smooth, the regularizer is r(θ) = λ 2 θ , then the excess empirical risk for a model Θ sampled from the Gibbs distribution π(D) ∝ e -L D /σ 2 iserr( Θ; D) = E L D ( Θ) -L D (θ * D )We simplify expected loss asE L D ( Θ) = L D π(D)dθ = σ 2 (H(π(D)) -log(Λ D )),(132)whereH(π(D)) = -π(D) log π(D)dθ = -e -L D /σ 2 Λ D log e -L D /σ 2 Λ D dθ (133)is the differential entropy of π(D), and Λ D = e -L D /σ 2 dθ is the normalization constant. Since the potential function L D is (λ + β)-smooth, we have-σ 2 log(Λ D ) = -σ 2 log e -L D /σ 2 dθ = L D (θ * D ) -σ 2 log e (L D (θ * D )-L D (θ))/σ 2 dθ ≤ L D (θ * D ) -σ 2 log e -(β+λ) θ-θ x) is σ 2 log(B)/4-bounded, note that for the Gaussian distribution ρ ∼ N 0, σ 2 λ I d , the density ratio lies in π(D)θ ∈ R d . We decompose entropy H(π(D)) into cross-entropy and KL divergence to get H(π(D)) = -π(D) log ρdθ -KL (π(D) ρ)

) and all models in sequence ( Θi ) i≥0 output by (A Noisy-GD , ĀNoisy-GD , Q) on any D 0 ∈ X n , where Q is an r-requester, satisfy err( Θi ; D i ) and (135) are equalities.

) Accuracy. Let θ * Di = arg min θ∈R dL Di (θ), and Θi ∼ π(D i ). We decompose the excess empirical risk of Noisy-GD as follows:err( Θi ; D i ) = E L Di ( Θi ) -L Di ( Θi ) + E L Di ( Θi ) -L Di (θ * Di ) .

Symbol reference Data universe and Domain of all datasets of size n. ν, ν , π, µ Arbitrary distributions on O or on R d . Number of records replaced per edit and number of releases observable by a requester. U, U r Space of singular and batched replacement edits in [n] × X . u, u i , U i Arbitrary edit request, i th edit request in U r and its random variable. D, D i

). A twice continuously differentiable function (θ; x) is convex if and only if for all θ ∈ R d and x ∈ X , its hessian matrix ∇ 2 (θ; x) is positive semidefinite, i.e., ∇ 2 (θ; x) 0 and is λ-strongly convex if its hessian matrix satisfies ∇ 2 (θ; x) λI d . Definition C.4 (Smoothness). A continuously differentiable function (θ; x) is said to be β-Smooth if for all θ, θ ∈ R d and x ∈ X , ∇ (θ; x) -∇ (θ ; x) 2 ≤ β θ -θ 2 .

D 0 ) ≤ err(Θ 0 ; D 0 )e -ληKA/2 + κdσ 2

Note that KĀ≥ K A -2B λη log

Appendix

Here µ ηk and µ ηk represent the kth iteration parameter distribution of Θ ηk and Θ ηk respectively. We interpolate the two discrete processes in two steps: (a) an identical transformation T k (as defined in (85), and (b) a diffusion process. If divergence before descent step is R k = R q µ ηk µ ηk , the stochastic mapping T k in (a) doesn't increase the divergence, while the diffusion (b) either increases it upto an asymptotic constant in phase I or decreases it exponentially to 0 in phase II.Proof. Conditioned on observing parameter Θ ηk = θ ηk , the process (Θ t ) ηk<t≤η( k+1) is a Langevin diffusion along a constant Vector field (i.e. on conditioning, we get a Langevin SDE (63) with ∇L(θ) = g k (θ ηk ) for all θ ∈ R d ). Therefore as per ( 69), the conditional probability density µ t|ηk (•|θ ηk ) of Θ t given Θ ηk follows the following Fokker-Planck equation:Taking expectation over µ ηk which is the distribution of Θ ηk ,where µ ηk,|t is the conditional density of Θ ηk given Θ t . Proof for second fokker-planck equation is similar.We provide an overview of how we bound equation ( 82) in Figure 1 . Basically, our analysis has two phases; in phase (I) we provide a bound on R q Θi-1 Θ i-1 that holds for any choice of number of iterations K A and KĀ, and in phase (II) we prove an exponential contraction in the divergence) with number of iterations KĀ.We first introduce a few lemmas that will be used in both phases. The first set of following lemmas show that the transformation Θ ηk , Θ ηk → T k (Θ ηk ), T k (Θ ηk ) preserves the Rényi divergence. To prove this property, we show that T k is a differentiable bijective map in Lemma 28 and apply the following Lemma from Vempala & Wibisono (2019) .Lemma 27 (Vempala & Wibisono (2019, Lemma 15) ). If T : R d → R d is a differentiable bijective map, then for any random variables Θ, Θ ∈ R d , and for all q > 0, R q (T (Θ) T (Θ )) = R q (Θ Θ) .(89)

