FROM ADAPTIVE QUERY RELEASE TO MACHINE UN-LEARNING

Abstract

We formalize the problem of machine unlearning as design of efficient unlearning algorithms corresponding to learning algorithms which perform a selection of adaptive queries from structured query classes. We give efficient unlearning algorithms for linear and prefix-sum query classes. As applications, we show that unlearning in many problems, in particular, stochastic convex optimization (SCO), can be reduced to the above, yielding improved guarantees for the problem. In particular, for smooth Lipschitz losses and any ρ > 0, our results yield an unlearning algorithm with excess population risk of nρ with unlearning query (gradient) complexity O(ρ • Retraining Complexity), where d is the model dimensionality and n is the initial number of samples. For non-smooth Lipschitz losses, we give an unlearning algorithm with excess population risk 1/2 with the same unlearning query (gradient) complexity. Furthermore, in the special case of Generalized Linear Models (GLMs), such as those in linear and logistic regression, we get dimension-independent rates of O 1 for smooth Lipschitz and non-smooth Lipschitz losses respectively. Finally, we give generalizations of the above from one unlearning request to dynamic streams consisting of insertions and deletions.

1. INTRODUCTION

The problem of machine unlearning is concerned with updating trained machine learning models upon request of deletions to the training dataset. This problem has recently gained attention owing to various data privacy laws such as General Data Protection Regulation (GDPR), California Consumer Act (CCA) among others, which empower users to make such requests to the entity possessing user data. The entity is then required to update the state of the system such that it is indistinguishable to the state had the user data been absent to begin with. While as of now, there is no universally accepted definition of indistinguishibility as the unlearning criterion, in this work, we consider the most strict definition, called exact unlearning (see Definition 1 for a formal definition).

Motivating Example:

The main objective of our work is to identify algorithmic design principles for unlearning such that it is more efficient than retraining, the naive baseline method. Towards this, we first discuss the example of unlearning for Gradient Descent (GD) method, which will highlight the key challenges as well as foreshadow the formal setup and techniques. GD and its variants are extremely popular optimization methods with numerous applications in machine learning and beyond. In a machine learning context, it is typically used to minimize the training loss, L(w; S) = 1 n n i=1 ℓ(w; z i ) where S = {z i } n i=1 is the training dataset and w, the model. Starting from an initial model w 1 , in each iteration, the model is updated as: w t+1 = w t -η∇ L(w t ; S) = w t -η 1 n n i=1 ∇ℓ(w t ; z i ) . After training, a data-point, say z n without loss of generality, is requested to be unlearnt and so the updated training set is S ′ = {z i } n-1 i=1 . We now need to apply an efficient unlearning algorithm such that its output is equal to that of running GD on S ′ . Observe that the first iteration of GD is simple enough to be unlearnt efficiently by computing the new gradient ∇ L(w 0 ; S ′ ) = 1 n-1 n∇ L(w 1 ; S) -∇ℓ(w 1 ; z n ) and updating as w ′ 2 = w 1 -η∇ L(w 1 ; S ′ ). However, in the second iteration (and onwards), the gradient is computed on w ′ 2 which can be different from w 2 and the above adjustment can no longer be applied and one may need to retrain from here onwards. This captures the key challenge for unlearning in problems solved by simple iterative procedures such as GD -adaptivityi.e., the gradients (or more generally, the queries) computed in later iteration depend on the result of the previous iterations. We systematically formalize such procedures and design efficient unlearning algorithms for them. We summarize our key contributions below.

1.1. OUR RESULTS AND TECHNIQUES

Learning/Unlearning as Query Release: Iterative procedures are an integral constituent of the algorithmic toolkit for solving machine learning problems and beyond. As in the case of GD above, these often consist of a sequence of simple but adaptive computations. The simple computations are often efficiently undo-able (as in the first iteration of GD) but its adaptive nature -change of result of one iteration changing the trajectory of the algorithm -makes it difficult to undo computation, or unlearn, efficiently. As opposed to designing unlearning (and learning) algorithms for specific (machine learning) problems, we study the design of unlearning algorithms corresponding to (a class of) learning algorithms. We formalize this by considering learning algorithms which perform adaptive query release on datasets. Specifically, this consists of a selection of adaptive queries from structured classes like linear and prefix-sum queries (see Section 3 for details). The above example of GD is an instance of linear query, since the query, which is the average gradient 1 n n i=1 ∇ℓ(w t ; z i ), is linear in the data-points. With this view, we study how to design efficient unlearning algorithms for such methods. We use efficiency in the sense of number of queries made (query complexity), ignoring the use of other resources, e.g., space, computation for selection of queries, etc. To elaborate on why this is interesting, firstly note that this does not make the problem trivial, in the sense that even with unlimited access to other resources, it is still challenging do design an unlearning algorithm with query complexity smaller than that of retraining (the naive baseline). Secondly, let us revisit the motivation from solving optimization problems. The standard model to measure computation in optimization is the number of gradient queries a method makes for a target accuracy, often abstracted in an oracle-based setup (Nemirovskij & Yudin, 1983) . Importantly, this setup imposes no constraints on other resources, yet it witnesses the optimality of well-known simple procedures like (variants of) GD. We follow this paradigm, and as applications of our results to stochastic convex optimization (SCO), we make progress on the fundamental question of understanding the gradient complexity of unlearning in SCO. Interestingly, our proposed unlearning procedures are simple enough that the improvement over retraining in terms of query complexity also applies even with accounting for the (arithmetic) complexity of all other operations in the learning and unlearning methods. Linear queries: The simplest query class we consider is that of linear queries (details deferred to Appendix B). Herein, we show that the prior work of Ullah et al. (2021) , which focused on unlearning in SCO and was limited to the stochastic gradient method, can be easily extended to general linear queries. This simple observation yields unlearning algorithms for algorithms for federated optimization, k-means clustering, etc. Herein, we give a ρ-TV stable (see Definition 2) learning procedure with T adaptive queries and a corresponding unlearning procedure with a O( √ T ρ) relative unlearning complexity (the ratio of unlearning and retraining complexity; see Definition 4). Prefix-sum queries: Our main contribution is the case when we consider the class of prefix-sum queries. These are a sub-class of interval queries which have been extensively studied in differential privacy and are classically solved by the binary tree mechanism (Dwork et al., 2010) . We note in passing that for differential privacy, the purpose of the tree is to enable a tight privacy accounting and no explicit tree may be maintained. In contrast, for unlearning, we show that maintaining the binary tree data structure aids for efficient unlearning. We give a binary-tree based ρ-TV stable learning procedure and a corresponding unlearning procedure with a O(ρ) relative unlearning complexity. Unlearning in Stochastic Convex Optimization (SCO): Our primary motivation for considering prefix-sum queries is its application to unlearning in SCO (see Section 2 for preliminaries).

1.. Smooth SCO:

The problem of unlearning in smooth SCO was studied in Ullah et al. (2021) which proposed algorithms with excess population risk of O 1 √ n + √ d nρ 2/3 where ρ is the relative Smooth, Lipschitz-SCO Variance Reduced-Frank Wolfe 1 √ n + √ d nρ

Lipschitz SCO

Dual Averaging 1 √ n + d 1/4 √ nρ Smooth, Lipschitz GLM JL + Variance Reduced-Frank Wolfe 1 √ n + 1 (nρ) 2/3 Lipschitz GLM JL + Dual Averaging unlearning complexity. We show that using a variant of variance-reduced Frank-Wolfe (Zhang et al., 2020) , which uses prefix-sum queries, yields an improved excess population risk of 1 √ n + 1 (nρ) 1/3 O 1 √ n + √ d nρ . 2. Non-smooth SCO: In the non-smooth setting, both algorithms proposed in Ullah et al. (2021) have trivial guarantees. Instead we give an algorithm based on Dual Averaging (Nesterov, 2009) , which again uses prefix-sum query access, and thus fits into the framework. This algorithm gives us an excess population risk of O 1 √ n + d 1/4 √ nρ . 3. Generalized Linear Models (GLM): Generalized linear models are one of most basic machine learning problems which include the squared loss (in linear regression), logistic loss (in logistic regression), hinge loss (support vector machines) etc. We study unlearning in two classes of GLMs (see below), for which we combine recently proposed techniques based on dimensionality reduction (Arora et al., 2022) with the above prefix-sum query algorithms to get the following dimension-independent rates.

3(a)

. Smooth GLM: For the smooth GLM setting, we combine Johnson-Lindenstrauss transform with variance reduced Frank-Wolfe to get O 1 √ n + 1 (nρ) 2/3 excess population risk. Note that we get no overhead in statistical rate even with very small relative unlearning complexity, ρ ≈ n -1/4 . This class of smooth GLMs contains the well-studied problem of logistic regression. Hence, our result demonstrates that it is possible to unlearn logistic regression with sub-linear (specifically, O(n 3/4 )) unlearning complexity with no sacrifice in the statistical rate. 3(b). Lipschitz GLM: Similarly, for the Lipschitz convex GLM setting, we combine Johnson-Lindenstrauss transform with dual averaging yielding a rate of O 1 √ n + 1 (nρ) 1/3 . Please see Table 1 for a summary of above results. SCO in dynamic streams: Finally, we consider SCO in dynamic streams (details deferred to Appendix F) where we observe a sequence of insertions and deletions and are supposed to produce outputs after each time-point. In this case, we present two methods: one which satisfies the exact unlearning guarantee with worse update time, the other which satisfies weak unlearning (see Definition 9) with improved update time. The exact unlearning method is inspired from the work of Ullah et al. (2021) which dealt with insertions similar to deletions. The weak unlearning method is motivated from the observation that the above may be too pessimistic. To elaborate, inserting a new data item does not warrant a (unlearning) guarantee that the algorithm's state be indistinguishable to the case if the point was not inserted. Hence, insertions should require smaller (ideally, constant) update time which is indeed the case for our proposed methods. 1.2 RELATED WORK Our work is a direct follow up of Ullah et al. (2021) which proposed the framework of Total Variation (TV) stability and maximal coupling for the exact machine unlearning problem. They applied this to unlearning in smooth stochastic convex optimization (SCO) and obtained a guarantee of 1 √ n + √ d nρ 2 3 on excess population risk, where n is the number of data samples, d, model dimensionality and ρ is the relative unlearning complexity (see Definition 4). We improve upon the results in that work in multiple ways as described in the preceding section. Furthermore, the exact unlearning problem has been studied for k-means clustering (Ginart et al., 2019) and random forests (Brophy & Lowd, 2021) . The work of Bourtoule et al. (2021) proposes a general methodology for exact unlearning for deep learning methods but do not provide rigorous theoretical guarantees on accuracy, even in simple settings. Besides this, there are works which consider unlearning in SCO, however they use an approximate notion of unlearning inspired from differential privacy (Guo et al., 2019; Neel et al., 2021; Sekhari et al., 2021; Gupta et al., 2021) , and therefore are incomparable to our work. Let Z be the data space, W be the model space and M be the meta-data space, where meta-data is additional information a learning algorithm may save to aid unlearning. We consider a learning algorithm as a map A : Z * → W × M and an unlearning algorithm as a map U : W × M × Z → W × M. We use A and U to denote the first output (which belongs to W) of A and U respectively. We recall the definition of exact unlearning from Ullah et al. (2021) which requires that the entire state after unlearning be indistinguishable from the state obtained if the learning algorithm were applied to the dataset without the deleted point. Definition 1 (Exact unlearning). A procedure (A, U) satisfies exact unlearning if for all datasets S, all z ∈ Z, and for all events E ⊆ W × M, we have, P (A (S\ {z}) ∈ E) = P (U (A(S), z) ∈ E) Unlearning request: We consider the setting where we start with a dataset of n samples and observe one unlearning request. We assume that the choice of unlearning request is oblivious to the learning process. In Section F, we generalize our result to the streaming setting. Total Variation stability, maximal coupling and efficient unlearning: The Total Variation (TV) distance between two probability distributions P and Q is TV(P, Q) = sup measureable E |P (E) -Q(E)|. Next, we define Total Variation (TV) stability, which was introduced in Ullah et al. (2021) to motivate algorithmic techniques for efficient unlearning. Definition 2. An algorithm A is said to be ρ Total Variation stable if for all datasets S and S ′ differing in one point, i.e. |S∆S ′ | = 1, the total variation distance, TV (A(S), A(S ′ )) ≤ ρ Given two distributions P and Q, a coupling is a joint distribution π with marginals P and Q. Furthermore, a maximal coupling is a coupling π such that the disagreement probability P (x,y)∼π {x ̸ = y} = TV(P, Q). In the unlearning context, P = A(S), the output on initial dataset, and Q = A(S ′ ), the output on the updated dataset. Hence, the unlearning problem simply becomes about transporting P to Q with small computational cost, akin to optimal transport (Villani, 2009) . Furthermore, when sampled from a maximal coupling between P and Q, by definition, we get the same sample for both P and Q, expect with probability ρ, and yet satisfying the exact unlearning criterion. The main idea is that for certain learning algorithms of interest, during unlearning, we can efficiently construct a (near) maximal coupling of P and Q, and so the same model output from P suffices for Q, most of the times. In particular, the fraction of times that we need change the model is the TV-stability parameter ρ of the learning algorithm. The goal, therefore, is to design an (accurate) TV-stable learning algorithm and a corresponding efficient coupling-based unlearning algorithm. In this work, we use the technique of reflection coupling described below. Reflection Coupling (Lindvall & Rogers, 1986 ): Reflection Coupling is a classical technique in probability to maximally couple symmetric probability distributions. Consider two probability distributions P and Q with means u and u ′ and let r be a sample from P . The process involves a rejection sampling step on the two distributions and sample r (see line 13 in in Algorithm 3). If it results in accept, we use the same r as the sample from Q, otherwise, we apply the following map: Reflect(u, u ′ , r) = u -u ′ + r, which gives the sample from Q (see line 16 in Algorithm 3). Our algorithmic techniques borrow tools from differential privacy (Dwork et al., 2014) such as its relationship with Total Variation stability; we describe these in Appendix A. Stochastic Convex Optimization (SCO): SCO is the dominant framework for computationallyefficient machine learning. Consider a convex (constraint) set W and let D denote its diameter. Let ℓ : W × Z → R be a loss function, which is convex in its first parameter ∀z ∈ Z. Given n i.i.d. points from an unknown probability distribution D over Z, the goal is to devise an algorithm, the output of which has small population risk, defined as L(w; D) := E z∈D ℓ(w; z). The excess population risk is then L(w; D) -L(w * ; D) where w * denotes a population risk minimizer over W. Generalized Linear Models (GLM): Generalized Linear Models (GLMs) are loss functions typically encountered in supervised learning problems, like linear and logistic regression. Herein, ℓ(w; (x, y)) = ϕ y (⟨w, x⟩), where ϕ y : R → R is some function. We use ∥X ∥ to denote the radius bound on data points, i.e. for x ∈ X ⊆ R d , ∥x∥ ≤ ∥X ∥. In this case, we consider the unconstrained setup i.e. W = R d , as it allows to get dimension-independent rates for GLMs, similar to what happens under differential privacy (Jain & Thakurta, 2014; Arora et al., 2022) . We introduce the Johnson-Lindenstrauss property below which is crucial to our construction. Definition 3 (Johnson-Lindenstrauss property). A random matrix Φ ∈ R k×d satisfies (β, γ)-JL property if for any u, v ∈ R d , with probability at least 1-γ, P (|⟨Φu, Φv⟩-⟨u, v⟩| ≥ β ∥u∥ ∥v∥) ≤ γ. There exists many efficient constructions of such random matrices (Nelson, 2011) .

3. UNLEARNING FOR ADAPTIVE QUERY RELEASE

We now set up the framework of adaptive query release, which is a lens to view (existing) iterative learning procedures; this view is useful in our design of corresponding unlearning algorithms. Iterative procedures run on datasets consist of a sequence of interactions with the dataset; each interaction computes a certain function, or query, on the dataset. The chosen query is typically adaptive, i.e., dependent on the prior query outputs. We consider iterative learning procedures which are composed of adaptive queries from a specified query class. Formally, consider a query class Q ⊆ W W * ×Z * ; herein, each query in Q is a function of a sequence of {w i } i<t (typically, prior query outputs), and the dataset S, with output in W. With this view, we give a general template of a learning procedure as Algorithm 1, where {U t } t and S are the update and selector functions internal to the algorithm. Query Model: We describe the query model which we use to measure computational complexity. Under the model, a query function q({w} i , S) takes |S| unit computations (or queries, for brevity) for any q and {w i } i . In our applications to SCO, this will correspond to the gradient oracle complexity. Our algorithmic approach to unlearning is rooted in the relationship between TV stability and maximal couplings. With this view, for a specified query class, we have the following requirements. 1. TV-stability: We want a ρ-TV stable "modification" (clarified later) of the learning Algorithm 1.

2.. Efficient unlearning algorithm:

We measure efficiency as the average number of queries the unlearning algorithm makes relative to the learning algorithm (retraining), defined as follows. Definition 4 (Relative Unlearning Complexity). The Relative Unlearning Complexity is defined as E (A,U) [Query complexity of unlearning algorithm U] E A [Query complexity of learning algorithm A] For a ρ-TV stable learning algorithm, we want that the relative unlearning complexity is (close to) ρ. This is because our proposed unlearning algorithm constructs a (near) maximal coupling of the learning algorithm's output under the original and updated dataset. This means that unlearning algorithm changes the original output (under the original dataset) with probability at most ρ -we want that in this case, the unlearning algorithm makes a number of queries akin to retraining. We also want that in the other case when it does change the output, it makes a small (ideally, constant) number of queries. We note that relative unlearning complexity, in itself, does not completely capture if the unlearning algorithm is good, since it may be the case that the corresponding learning algorithm is computationally more expensive than other existing methods. However, in our applications to SCO (Section 5), all the learning algorithms are linear time, so the denominator in the definition above is as small as it can be (asymptotically), i.e. Θ(n).

3.. Accuracy:

We will primarily be concerned with correctness of the unlearning algorithm and its efficiency. In the applications (Section 5), we will give accuracy guarantees for specific problems, where we will see (roughly) that if the learning algorithm is noise-tolerant, then the solution produced by the proposed TV stable modified algorithm are still accurate.

4. PREFIX SUM QUERIES

We now consider prefix-sum queries, which is the main contribution of this work. The reason for considering this query class is that two powerful algorithms for SCO, dual averaging and recursive variance reduction fit into this template. We start by defining a prefix-sum query. Definition 5. A set of queries {q t } t≥1 where q t : W t × Z n → W are called prefix-sum queries if q 1 (w 1 , S) = p 1 (w 1 , z 1 ) and for all t > 1, q t ({w i } i≤t , S) = q t-1 ({w i } i<t , S) + p t {w i } i≤t , z t ) for some functions {p t } t≥1 where p t : W * × Z → W. Simply put, prefix-sum queries, sequentially query new data points and adds them to the previous accumulated query. A simple example is computing partial sums of data points (z 1 , z 1 + z 2 , . . .). Note that in the above definition, we can equivalently represent the prefix-sum queries using the sequence {p t } t . We also assume that the queries have bounded sensitivity, defined as follows. Definition 6. A query q : W * ×Z n → W is B-sensitive if sup {wi} i sup S,S ′ :|S∆S ′ |=1 ∥q ({w i } i , S) -q ({w i } i , S ′ )∥ ≤ B. We note that the bounded sensitivity condition is satisfied in a variety of applications; see Section 5.

4.1. LEARNING WITH BINARY TREE DATA-STRUCTURE:

The learning algorithm, given as Algorithm 2, is based on answering the adaptive prefix-sum queries with the binary tree mechanism (Dwork et al., 2010) . For n samples (assume n is a power of two, otherwise we can append dummy "zero" samples without any change in asymptotic complexity), the binary tree mechanism constructs a complete binary tree T with the leaf nodes corresponding to the data samples. The key idea in the binary tree mechanism is that instead of adding fresh independent noise to each prefix-sum query, it is better to add correlated noise, where the correlation structure is described by a binary tree. For example, suppose we want to release the seventh prefix-sum query, 7 i=1 p i ({w j } j≤i , z i ), then consider the dyadic decomposition of 7 as 4, 2 and 1, and release the sum 4 i=1 p i ({w j } j≤i , z i ) + ξ 1 + 6 i=5 p i ({w j } j≤i , z i ) + ξ 2 + p 7 ({w j } j≤i , z i ) + ξ 3 where ξ i 's denote the added noise, which may have also been used in prior prefix-sum query responses. See Figure 1 for a simplified description of the process. We index the nodes of the tree using using binary numbers B = {0, 1} log(n) which describes the path from the root. Let the tree T = {v b } b∈B which denotes the contents stored by the learning algorithm. Herein, each node contains the tuple (u, r, w, z) where u ∈ R d is the query response, r ∈ R d is the noisy response, w ∈ R d a model and z ∈ Z a data point. In fact, only the leaf nodes store the model and data sample. Finally, define leaf : [n] → {0, 1} log(n) which gives the binary representation of the input leaf node. This binary tree data structure supports the following operations: 1. Append(u, σ; T ): Add a query response u, perturbed by noise of variance σ 2 , to T , which involves adding noisy u to u v for v in the path from leaf to root. 2. GetPrefixSum(t; T ), where t ∈ N: Get the t-th noisy response from T . 3. Get(b; T ) where b ∈ {0, 1} log(n) : Get all items in the vertex of T indexed by b. 4. Set(b, v; T ) where b ∈ {0, 1} log(n) : Set the contents of vertex b in the T as v. Following Guha Thakurta & Smith (2013), we give pseudo-codes of the above operations in Section C, with minor modifications to aid the unlearning process.

4.2. UNLEARNING BY MAXIMALLY COUPLING BINARY TREES

The unlearning Algorithm 3 is based on constructing a (near) maximal coupling of the binary trees under current and updated dataset. Let z j be the element to be deleted and let v s be the leaf node which contains z j (we use z in place of z j from here on, for simplicity). During unlearning, we simulate (roughly speaking) the dynamics of the learning algorithm if the deleted point was not present to begin with. In that case, in place of the deleted point, some other point would have been used. Now, since the dataset was randomly permuted, every point is equally likely to have been used, and thus we can use the point z ′ in the last leaf node, say v l , in the tree -this choice of the last point is important for unlearning efficiency. Firstly, the computations associated with the last point z ′ needs z8 -g + g ′ RS3 +ξ 4 +ξ 2 p1(w0)+ξ 1 z7 p2(w≤1) z2 + p3(w≤2)+ξ 3 z1 p6(w≤3) z5 -g + g ′ -g + g ′ RS2 -g + g ′ RS1 z4 z8 p6(w≤5) z3 + p7(w≤6)+ξ 7 z6 p8(w≤7) z8 Figure 1 : A simplified schematic of the learning (left) and unlearning (right) procedures for prefixsum queries. In the left, the leaves contain (noisy, if +ξ i ) prefix-sum queries applied on the randomly permuted data-point (z i 's) below it. The intermediate nodes with + adds the not-noised values of its children, where as others add noise to it. On the right, the deleted point z 4 is replaced with z 8 which amounts to adjusting the queries with -g + g ′ (see Algorithm 3 for details) and performing Rejection Sampling (abbreviated RS i , where i indicates the order of occurrence) along the height of the tree. Algorithm 2 TreeLearn(t 0 ; T ) Input: Dataset S, steps T , B-sensitive prefix queries {p t } t≤T , update functions {U t } t≤T , σ 1: if t 0 = 1 then Permute dataset and initialize T end if 2: (•, •, w t0 , •) = Get(leaf(t 0 ); T ) 3: for t = t 0 to |S| -1 do 4: u t = p t ({w t } i≤t , z t ) 5: Append(u t , σ; T ) 6: r t = GetPrefixSum(t; T ) 7: w t+1 = U t {w t } ≤t , r t 8: Set(leaf(t), (u t , r t , w t , z t ) ; T ) 9: end for Output: w = S ({w t }) to be undone -towards this, we update the contents of the nodes in the path from node v l to root (line 5), finally removing node v l from the tree (line 6). Then, we need to replace all the computations which used the deleted point z with the same computation under z ′ . Since the learning algorithm was based on the binary tree mechanism, the point z was only explicitly used in the nodes from leaf v s to the root (so, at most log (n) nodes). We say explicitly above because due to the adaptive nature of the process, in principle, all nodes after v s depend on it, in the sense that their contents would change if the response in v s were to change. However, importantly, the binary search structure of our learning algorithm and our coupling technique (details below) would enable us to (mostly) only care about explicit computations. We first compute two new queries, under the data point z and z ′ , with responses g = p j ({w q } q≤s , z) and g ′ = p j ({w q } q≤s , z ′ ) respectively (line 3). Starting with leaf node v s , we update the original unperturbed prefix-sum query response under z i.e. u to what it would have been under data-point z ′ : u ′ = u -g ′ + g (line 11). Further, since the training method adds noise N (0, σ 2 I) to u to produce original noisy response r, we now need to produce a sample from N (u ′ , σ 2 I) to satisfy exact unlearning. Naively, we could simply get a fresh independent sample from N (u ′ , σ 2 I), however, this would change the noisy response r, and hence require all subsequent computations to be redone (the adaptive nature). So, ideally, we want to reuse the same r and yet generate a sample from N (u ′ , σ 2 I). This is precisely the problem of constructing a maximal coupling, discussed in the Section 2, wherein we also discussed the method of reflection coupling to do it. This amounts to doing a rejection sampling which (roughly) ascertains if response r is still sufficient under the new distribution N (u ′ , σ 2 I). Specifically we compute the ratio of the probability densities at r under the noise added to u and u ′ , i.e. (r) and compare it against a randomly sampled Unif(0,1); if it results in accept, we move to parent of the node v s , and repeat. If any step fails, we reflect which generates a different noisy response r ′ , and continue retraining from the next leaf w.r.t. the post order traversal of the tree (the variable ct in Algorithm 3 keeps track of this next node). See Figure 1 for a simplified description of the process. The main result of this section is as follows. Theorem 1. The following are true for Algorithms 2 and 3, Algorithm 3 TreeUnlearn Input: z j : data point to be deleted, , T : internal tree data-structure saved during learning 1: s = leaf(j) and l = leaf(|S|)  ϕ N (u,σ 2 I) (r) ϕ N (u ′ ,σ 2 I) 2: (•, •, w, z) = Get(s; T ) and (•, •, •, z ′ ) = Get(l; T ) 3: g = p j ({w q } q≤s , z) and g ′ = p j ({w q } q≤s , z ′ ) 4: Let path = {l → • • • → root} (u, r, •, •) = Get(b; T ) 11: u ′ = u -g + g ′ 12: if Unif (0, 1) ≤ ϕ N (u,σ 2 I) (r) ϕ N (u ′ ,σ 2 I) (r) then 13: if b = s then Set(b, (u ′ , r, w, z ′ ) ; T ) else then Set(b, (u ′ , r, ∅, ∅) ; T ) end if 14: else 15: 2. The corresponding unlearning Algorithm 3 satisfies exact unlearning. r ′ = Reflect(u, u ′ , r) 16: if b = s then 17: Set(b; (u ′ , r ′ , ∅, z ′ ) ; T ) 18: w ′ = U j {w q } q≤b , GetPrefixSum(j; T ) 19: Set(b, (u ′ , r ′ , w ′ , z ′ ) ; T ) 20:

3.. The relative unlearning complexity is O (ρ)

As discussed in the preceding section, in the Theorem above, we have all the properties we needed with the unlearning process. We now move on to applications and give accuracy guarantees.

5. APPLICATIONS

In the following, we describe some problems and learning algorithms. The corresponding unlearning algorithms and its correctness simply follow as application of the result of the preceding section, provided we show that it uses a bounded sensitivity prefix-sum query. The only other thing to show is the accuracy guarantee of the TV stable modification of the learning algorithm (Algorithm 2). We use runtime to mean gradient complexity as is standard in convex optimization (Nemirovskij & Yudin, 1983) . But, as pointed out before, our proposed unlearning algorithm yields similar improvements over retraining, even accounting for other operations in the method.

5.1. SMOOTH SCO WITH VARIANCE REDUCED FRANK-WOLFE

We assume that the loss function w → ℓ(w; z) is H-smooth and G-Lipschitz for all zfoot_0 . The algorithm we use is variance reduced Frank-Wolfe method where the variance reduced gradient estimate u t is the Hybrid-SARAH estimate (Tran-Dinh et al., 2019) with γ t = 1 t+1 given as, u t = (1 -γ t ) (u t-1 + ∇ℓ(w t ; z t ) -∇ℓ(w t-1 ; z t )) + γ t ∇ℓ(w t ; z t ) = 1 t + 1 t i=1 ((i + 1) ∇ℓ(w i ; z i ) -i∇ℓ(w i-1 ; z i )) We show that the above is a prefix sum query with sensitivity B = 2 (HD + G), thus fits into our framework. The full pseudo-code is given as Algorithm 12 in Appendix E. We state the main result below where the accuracy guarantee follows from modifications to the analysis in Zhang et al. (2020) . Theorem 2. Let ρ ≤ 1 and ℓ : W × Z → R be an H-smooth, G-Lipschitz convex function over a convex set W of diameter D. Algorithm 12, as the learning algorithm, run with σ 2 = 64(HD+G) 2 log 2 (n) ρ 2 , t 0 = 1 and η t = 1 t+1 on a dataset S of n i.i.d. samples from D outputs w, with excess population risk bounded as, E [L( w; D) -L(w * ; D)] = O (G + HD) D 1 √ n + √ d nρ . Furthermore, the corresponding unlearning Algorithm 3 (with query and update functions as specified in the learning algorithm), satisfies exact unlearning with O (ρn) expected runtime.

5.2. NON-SMOOTH SCO WITH DUAL AVERAGING

In this section, we only assume that loss function w → ℓ(w; z) is G-Lipschitz and convex ∀ z ∈ Z. Herein, we use dual averaging method (Nesterov, 2009) where the model is updated as follows: w t+1 = Π W w 0 -η t i=1 ∇ℓ(w i ; z i ) , where Π denotes the Euclidean projection on to the convex set W. The above again is a prefix-sum query with sensitivity G, thus fits into our framework. The full pseudo-code is given as Algorithm 13 in Appendix E. The accuracy guarantee mainly follows from Kairouz et al. (2021) . Theorem 3. Let ρ ≤ 1 and ℓ : W × Z → R be a G-Lipschitz convex function over a convex set W of diameter D. Algorithm 13, as the learning algorithm, run with . Furthermore, the corresponding unlearning Algorithm 3 (with query and update functions as specified in the learning algorithm), satisfies exact unlearning with O (ρn) expected runtime. σ 2 = 64G 2 log 2 (n) ρ 2 , t 0 = 1 and η = Dd 1/4 √ log(n) G √ nρ

5.3. CONVEX GLM WITH JL METHOD

Algorithm 4 JL method Input: Dataset S, loss function ℓ, Algorithm A, JL matrix Φ ∈ R d×k , Noise variance σ 2 1: ΦS = {Φx i } n i=1 2: w = A(ℓ, ΦS, 2G ∥X ∥ , 2H ∥X ∥ 2 , σ) Output: w = Φ ⊤ w This JL method, proposed in Arora et al. (2022) , is a general technique to get dimension-independent rate for unconstrained convex GLMs from algorithms giving dimension-dependent rate for constrained (general) convex losses. The method, described in Algorithm 4, simply embeds the dataset into a low dimensional space, via a JL matrix Φ, and then runs a base algorithm on the low dimensional dataset. Smooth, Lipschitz GLMs: We assume that ϕ y : R → R is convex, H-smooth and G-Lipschitz for all y ∈ Y. Using VR-Frank Wolfe as the base algorithm, we get a rate of O 1 √ n + 1 (nρ) 2/3 with relative unlearning efficiency of ρ -see Theorem 5 for a precise statement. Lipschitz GLMs: We assume that ϕ y : R → R is convex and G-Lipschitz for all y ∈ Y. We give the following result in this case using Dual Averaging as the base algorithm. Using Dual Averaging as the base algorithm, we get a rate of O 1 √ n + 1 (nρ) 1/3 with relative unlearning efficiency of ρsee Theorem 6 for the precise statement.

A ADDITIONAL PRELIMINARIES

We recall some concepts from differential privacy which will be useful in our algorithmic techniques. Definition 7. An algorithm A satisfies (α, ϵ(α))-Rényi Differential Privacy (RDP), if for any two datasets S and S ′ which differ in one data point (|S∆S ′ | = 1), the α-Rényi Divergence between A(S) and A(S ′ ), with probability densities ϕ A(S) and ϕ A(S ′ ) , defined as follows: D α (A(S)∥A(S ′ )) = 1 α -1 ln Range(A) ϕ A(S) (x) α ϕ A(S ′ ) (x) 1-α dx is bounded as, D α (A(S)∥A(S ′ )) ≤ ϵ(α). RDP satisfies many desirable properties such as adaptive and parallel composition and amplification by sub-sampling (Mironov, 2017; Wang et al., 2019) . Furthermore, we give the following lemma which relates TV stability to RDP. Lemma 1 (RDP =⇒ TV-stability). If an algorithm satisfies (α, ϵ(α))-RDP, then it satisfies  1 -exp -lim α↓1 ϵ(α)

B UNLEARNING FOR LINEAR QUERIES

A basic form of a query we consider is a linear query, defined as follows. Definition 8. A query q : W * × Z n → W is a linear query if q ({w i } i ; S) = j∈S p j ({w i } i ; z j ) for some functions p j : W * × Z → W. We consider the class of B-sensitive linear queries. We give the TV stable modified learning procedure in Algorithm 5 which basically releases the linear queries perturbed with Gaussian noise of appropriate variance. Algorithm 5 LearnLinearQueries(w t0 , t 0 ) Input: Dataset S, initial iteration t 0 , steps T , query functions {q t (•)} t≤T , update functions {U t (•)} t≤T , Selector function S(•), noise variance σ 1: Initialize model w 1 ∈ W 2: for t = t 0 to T -1 do 3: Query the dataset u t = q t {w i } i≤t ; S .

4:

Perturb: r t = u t + ξ t where ξ t ∼ N (0, σ 2 I d ).

5:

Update w t+1 = U t ({w i } i≤t , r t ) 6: Save (u t , r t , w t+1 ) 7: end for Output: w = S {w t } t≤T Note that the underlying probability distribution that the above learning algorithm samples from is a Markov chain. The corresponding unlearning procedure, described in Algorithm 6, is based on constructing a coupling between the Markov chains for the current dataset and the dataset without the to-be-deleted point. In particular, we start from the first iteration, perform rejection sampling, if it results in acceptance, then we proceed to the second iteration and so on. If some iteration results in rejection, then we do the reflection step, and continue retraining from there on. Algorithm 6 Unlearning algorithm for linear queries Input: Deleted point z j , 1: for t = 1 to T -1 do 2: (u t , r t , w t ) = Load () 3: Compute u ′ t = u t -p j t {w i } i≤t ; z j 4: if Unif (0, 1) ≤ ϕ N (u t ,σ 2 I) (rt) ϕ N (u ′ t ,σ 2 I) (rt) then 5: Save (u ′ t ) 6: else 7: r ′ t = reflect(r t , u t , u ′ t ) 8: w t+1 = U t {w i } i≤t , r ′ t 9: LearnLinearQueries(w t+1 , t + 1) 10: break 11: end if 12: end for The above is basically the same unlearning algorithm as that of Ullah et al. (2021) but presented in the general context of linear queries. Therefore, it generalizes the framework of Ullah et al. (2021) which was limited to the Stochastic Gradient Descent algorithm. We also remark that linear queries can often be augmented with a sub-sampling operator yielding amplified guarantees, as done in Ullah et al. (2021) . However, we omit this extension for brevity. The main result of this section is as follows. Theorem 4. The following are true for Algorithms 5 and 6, 1. The learning algorithm, Algorithm 5 with σ 2 = 64B 2 n 2 ρ 2 satisfies ρ-TV stability. 2. The unlearning algorithm, Algorithm 6, corresponding to Algorithm 5, satisfies exact unlearning.

3.. The relative unlearning complexity is

O ρ √ T . Proof. This proof simply follows from the observation that the analysis of Ullah et al. (2021) only uses the bounded sensitivity linear query structure of the stochastic gradient method for their TV stability bound as well as correctness and runtime of the unlearning procedure.

B.1 APPLICATIONS

This generalization yields the following applications.

B.2 FEDERATED UNLEARNING FOR FEDERATED AVERAGING

In the federated learning setting, we have C clients (which typically correspond to user devices) with their own datasets and a parameter server (aggregator). A typical, informal, goal is training a single globally shared model using all the dataset with small communication between the clients and the server, and without moving any private data (explicitly) to the server. Federated Averaging (Konecnỳ et al., 2016) , described in Algorithm 7, is a widely used method in federated learning. Note that in the every round of the method, the client outputs, {w c t } C c=1 , are aggregated using an averaging operation: w t = 1 C C c=1 w c t . In Algorithm 7, ClientUpdate is a function which runs on the client's data using the current model w t and problem specific-parameter P (such as as number of steps, learning rate of some optimization routine). For brevity, we do not instantiate the ClientUpdate function, but usually some variant of stochastic gradient descent is used. w c t+1 = ClientUpdate (c, w t-1 , P) 5: end for 6: w t+1 = 1 C C c=1 w c t+1 7: end for Output: w = S {w t } t≤T Federated Unlearning: In the federated unlearning problem, after a model is trained, one of the clients requests to remove themselves from the process. The parameter server then needs to update the model (and state) in such a way that it is indistinguishable to the state if the client were absent. Hence, this is analogous to the standard unlearning problem with the client playing the role of a data point. This analogy also occurs with private federated learning wherein the widely-used granularity of differential privacy is user-level differential privacy (McMahan et al., 2017) . In this case, a client (potentially containing multiple data items) plays the role of a data item, the presence/absence of which is used in the differential privacy definition. TV-stable learning and unlearning: The model aggregation step (line 6 in Algorithm 7) of the federated averaging method is a linear query over the clients. Moreover, if the clients output models that are bounded in norm, then it is a bounded sensitivity linear query (typically enforced by clipping the updates). Hence, this fits into the template of linear query release method and thus can be modified, as in Algorithm 5 to be TV stable. The corresponding unlearning method is the one given in Algorithm 6.

B.3 LLOYD'S ALGORITHM FOR k-MEANS CLUSTERING

In this section, we briefly discuss how an algorithm for k-means clustering fits into the linear query release framework. We remark that the prior work of Ginart et al. (2019) gave an unlearning method for this problem based on randomized quantization, which can also be seen as a specific TV-stable algorithm followed by a coupling based unlearning method. Lloyd's algorithm is a widely used method for k-means clustering. Herein, starting with an arbitrary choice of centers, we construct a partition of the dataset, which thereby gives a new set of centers. This process is repeated for a certain number of rounds. The method is described as Algorithm 8. We notice again that the updates for every cluster, line 7 in Algorithm 8, is a linear query, hence it fits into the linear query release template and thus learning and unlearning algorithms based on linear queries readily follow. end if 8: end while Output: g

D MISSING PROOFS FROM SECTION 4

Proof of Theorem 1. The first part of the Theorem follows from Lemma 2 followed by postprocessing to argue that the same TV stability parameter holds for the final iterate. The second part, exact unlearning, follows from Lemma 5 wherein Q denotes the distribution of the algorithm's output run on the dataset without the to-be-deleted point. For the third part, note that the unlearning algorithm 3 makes two queries if no retraining is triggered. If a retraining is triggered, the number of queries it makes is at most the query complexity of learning algorithm, T = n. Finally, the probability of retraining, from Lemma 6 is at most log (n) ρ. Combining, this gives the stated bound on relative unlearning complexity.

D.1 LEMMAS FOR UNLEARNING

Additional notation: We first present some additional notation used in the statement and proof of the following lemmas. Let S and S ′ be datasets before and after the unlearning request. Let P and Q denote the probability measures over the range of tree data-structure, which is T = R d × R d × R d × [n] n , induced by the output of learning algorithm on S and S ′ respectively. We order the nodes of the binary tree w.r.t. the post-order traversal of tree. Hence, given two nodes v and v ′ or their binary representations s and s ′ , we use v ≤ v or s ≤ s ′ w.r.t the above ordering. Given a Proof. The proof is based on induction on the nodes of T (2) in the post-order traversal. Let v be the nodes of the tree arranged in the post-order traversal order. Given j, index of the item deleted, let s = leaf(j). Define prefix(s) and suffix(s), as set of nodes before and after s respectively in the ≤ order. Given an event E ⊆ R d| T (2) | and r ≤b , define E We now factorize the probability below as, P T (2) ∈ E|p (1) , p (2) = b∈prefix(s) P r (2) b ∈ E r (2) <b b |p (2) b , r <b P T (2) ≥s ∈ E r (2) <s ≥s | T (2) <s , p (1) , p (2) = b∈prefix(s) P r (1) b ∈ E r (1) <b b |p (1) b , r <b P T (2) ≥s ∈ E r (2) <s ≥s | T (2) <s , p (1) , p (2) = b∈prefix(s) P b E r (1) <b b |p (1) b , r <b P T (2) ≥s ∈ E r (2) <s ≥s | T (2) <s , p (1) , p (2) = b∈prefix(s) Q b E r (2) <b b |p (2) b , r <b P T (2) ≥s ∈ E r (2) <s ≥s | T (2) <s , p (1) , p (2) = Q <s E <s |p (2) <s , r (2) <s P T (2) ≥s ∈ E r (2) <s ≥s | T (2) <s , p (1) , p (2) where the second equality follows since r (1) ≤b = r (2) ≤b and p (1) b = p (2) b for all b < s by construction. The third equality follows since r (1) b is distributed as P b conditionally and fourth and final follows since conditioned on the permutation being the same, the prefix is also distributed as Q <s . We now start the induction: let I(induction variable) be I = s i.e the last item is deleted. In this case, the unlearning algorithm simply removes the s-th node of the tree and all we are left with is the tree with prefix(s) nodes, which as argued above is distributed as Q <s = Q. For the case I = s + 1: we simply focus on T (2) ≥s = T (2) s = r (2) s . Note that r (1) s is distributed as N (u (1) , σ 2 I) and we want r (2) s distributed as N (u (2) , σ 2 I). The operation in the algorithm is basically a one step reflection coupling which from Lemma 1 in Ullah et al. (2021)  satisfies, P r (2) s ∈ E r (2) <s s |p (1) , p (2) = Q p (2) s s E r (2) <s s Therefore, P T (2) ∈ E|p (1) , p (2) = Q <s E <s |p (2) <s , r (2) <s Q p (2) s s E r (2) <s s = Q p (2) (E) This finishes the base cases. We now proceed to the induction step: suppose the following claim holds for nodes upto I = k -for any event E, the marginal distribution P T (2) ≤k ∈ E|p (1) , p (2) = Q ≤k E|p (2) For node k + 1, consider a few cases: 1. A: All rejection sampling steps prior to node k resulted in accepts: (a) AP: Node k + 1 lies in the path from the s to root. i. APA: The rejection sampling at this node succeeds. ii. APR: The rejection sampling at this node fails i.e. a reflection step is performed. (b) AN: Node k + 1 doesn't lie in the path from s root. 2. R: Some rejection sampling step resulted in rejection. For case R, we have that r (2) 2) ). For the case AN, note that the random variable r k+1 ∼ Q k+1 (•| T (2) ≤k , p (2) k+1 = r (1) k+1 , hence, P r (2) k+1 ∈ E r (2) ≤k k+1 |AN, T (2) ≤k , p (1) , p (2) = P k+1 E r (2) ≤k k+1 |p (2) , T (2) ≤k = Q k+1 E r (2) ≤k k+1 |p (2) , T (2) ≤k where the last equality follows since the dependence of r (2) k+1 is only on data points which are leaves of the sub-tree rooted at node k + 1. These, by assumption do not contain the data point s, hence is identically distributed as P k+1 . For the event AP, we have,

P r

(2) k+1 ∈ E r (2) ≤k k+1 |AP, p (1) p (2) , T (2) = P r (2) k+1 ∈ E r (2) ≤k k+1 , APA|AP, p (1) , p (2) , T (2) ≤k + P r (2) k+1 ∈ E r (2) ≤k k+1 , APR|AP, p (1) , p (2) , T (2) ≤k = Q k+1 E r (2) ≤k k+1 |p (1) , p (2) , T (2) ≤k where the last step follows from Lemma 1 in Ullah et al. (2021) . Hence, combining AP and AN cases, P r (2) k+1 ∈ E r (2) ≤k k+1 |AN, T (2) ≤k , p (1) , p (2) = Q k+1 E r (2) ≤k k+1 |p (2) , T (2) ≤k We now combine all the cases: let ϕ (A) ≤k , ϕ (R) ≤k denote the conditional densities of T (2) ≤k under events A and R respectively. Let T k = T (2) ≤k . For any event E, P T (2) ≤k+1 ∈ E|p (1) , p (2) = P r (2) k+1 ∈ E r (2) ≤k k+1 |A, T (2) ≤k ∈ E ≤k , p (1) , p (2) P T (2) ≤k ∈ E r (2) ≤k k+1 , A|p (1) , p (2) + P r (2) k+1 ∈ E r (2) ≤k k+1 |R, T (2) ≤k ∈ E ≤k , p (1) , p (2) P T (2) ≤k ∈ E ≤k , R|p (1) , p (2) = R dT k+1 1 r (2) k+1 ∈ E r (2) ≤k k+1 1 T (2) ≤k ∈ E ≤k 1 T (2) ≤k ∈ A ϕ (A) ≤k T (2) ≤k + 1 T (2) ≤k ∈ R ϕ (R) ≤k T (2) ≤k ϕ Q p (2) k+1 r (2) k+1 | T (2) ≤k d T (2) ≤k dr (2) k+1 = R dT k+1 1 T (2) ≤k+1 ∈ E ϕ Q p (2) ≤k T (2) ≤k ϕ Q p (2) k+1 r (2) k+1 | T (2) ≤k d T (2) ≤k dr (2) k+1 = Q p (2) ≤k+1 (E) where in the third equality, we use the induction hypothesis. This completes the proof of the lemma. Lemma 5. For any measurable event E ⊆ T, P[T (2) ∈ E] = Q(E). Proof. This follows primarily from Lemma 4, and the fact that other elements in nodes of T , namely u b and w b are deterministic functions of the prefix vertices in the tree T . Consider a decomposition of the event E = E u × E r × E w × E z . Now, P[T (2) ∈ E] = E p (1) P T (2) ∈ E u × E r × E w × E z |p (1) , p (2) ∈ E z P p (2) ∈ E z = E p (1) P T (2) ∈ E r |p (1) , p (2) µ n-1 (E z ) = E p (1) Q p (2) (E r ) µ n-1 (E 2 ) = E p (1) Q p (2) (E u × E w × E r ) µ n-1 (E z ) = Q(E) where the second and fourth equality follows since variables w b and u b are deterministic functions of the responses r ≤b . The second and third equality also uses Lemma 3 and Lemma 4 respectively. Lemma 6. The probability of retraining is at most log (n) ρ. Proof. A retraining is triggered only when a rejection sampling step fails. Note that a rejection sampling step happens only when the node b belongs to the path from s to root, say path. Let Accept be the event when all rejection sampling steps succeed. P (Accept) = E T (1) ,T (2) ,{u b } b∈path 1   ub ≤ ϕ Q p (2) b r (1) b |T (1) <b ϕ P p (2) b r (1) b |T (1) <b    = E T (1) ,p (1) ,p (2) b∈path P   ub ≤ ϕ Q p (2) b r (1) b | T (1) <b ϕ P p (1) b r (1) b | T (1) <b    = E p (1) ,p (2) b∈path R d min ϕ Q p (2) b r (1) b | T (1) <b , ϕ P p (1) b r (1) b | T (1) <b dr (2) b = E p (1) ,p (2) b∈path 1 -TV Q p (2) b , P p (1) b | T (1) <b = b∈path (1 -ρ b ) ≥ 1 - b∈path ρ b ≥ 1 -log (n) max b ρ b ≥ 1 -log (n) ρ where the fourth equality follows from the definition of TV distance and in the last equality, ρ b denotes the (conditional) TV distance of node b. The third to last inequality follows from Lemma 7 and the second to last inequality follows from Holder's inequality. For the last inequality, we simply upper bound ρ b ≤ ρ since the algorithm is ρ-TV stable (Lemma 2). This completes the proof. Lemma 7. Let {a i } k i=1 be real numbers such that a i ∈ (0, 1) for all i and k i=1 a i ≤ 1. Then, k i=1 (1 -a i ) ≥ 1 - k i=1 a i Proof. We prove this via induction on k. The base case k = 1 is immediate. For the induction step k, we have k i=1 (1 -a i ) = k-1 i=1 (1 -a i ) (1 -a k ) ≥ 1 - k-1 i=1 a i (1 -a k ) = 1 - k i=1 a i + k-1 i=1 a i a k ≥ 1 - k i=1 a i This completes the proof. E MISSING PROOFS FROM SECTION 5 E.1 VARIANCE-REDUCED FRANK WOLFE Algorithm 12 Variance-reduced Frank Wolfe(t 0 ; T ) Input: Dataset S, loss function (w, z) → ℓ(w, z), steps T , σ,{η t } t 1: if t 0 = 1 then Permute dataset, initialize T , set w t0 = 0 end if 2: for t = 1 to T -1 do 3: u t = t i=1 ((i + 1) ∇ℓ(w i ; z i ) -i∇ℓ(w i-1 ; z i )) 4: Append(u t , σ; T ) 5: r t = GetPrefixSum(t; T ) 6: v t = arg min w∈W w, rt t+1 7: w t+1 = (1 -η t )w t + η t v t 8: Set(leaf(t), (u t , r t , w t , z t ) ; T ) 9: end for Output: w = w T Proof of Theorem 2. For the accuracy guarantee, we follow the proof of Theorem 1 in Zhang et al. (2020) . Let d t = rt t+1 . From smoothness, we have L(w t+1 ; D) ≤ L(w t ; D) + ⟨∇L(w t ; D), w t+1 -w t ⟩ + H 2 ∥w t+1 -w t ∥ 2 ≤ L(w t ; D) + η t ⟨∇L(w t ; D) -d t , v t -w t ⟩ + ⟨d t , v t -w t ⟩ + η 2 t HD 2 2 = L(w t ; D) + η t ⟨∇L(w t ; D) -d t , v t -w t ⟩ + η t ⟨d t , w * -w t ⟩ + η 2 t HD 2 2 ≤ L(w t ; D) + η t ⟨∇L(w t ; D), w * -w t ⟩ + η t ⟨d t -∇L(w t ), w * -v t ⟩ + η 2 t HD 2 2 ≤ (1 -η t ) L(w t ; D) -η t L(w * ; D) + 2D t + 1 ∥d t -∇L(w t ; D)∥ + η 2 t HD 2 2 where the second inequality follows from the update and the fact that iterates lie in the set of diameter D. The third inequality follows from the optimality of v t in the update in Algorithm 12. Finally, the last inequality follows from convexity, Cauchy-Schwarz inequality and by substituting the step-size. We now take expectation, and use the bound on gradient estimation error in Lemma 8 to get, E[L(w t+1 ; D) -L(w * ; D)] ≤ (1 -η t ) E[L(w t ; D) -L(w * ; D)] + O (HD + G) D 1 (t + 1) 3/2 + √ d (t + 1) 2 ρ + HD 2 2 (t + 1) The above recursion gives us, E[L(w T ; D) -L(w * ; D)] ≤ (L(w 1 ; D) -L(w * )) T -1 t=1 (1 -η t ) + T -1 i=1 O (HD + G) D 1 (i + 1) 3/2 + √ d (i + 1) 2 ρ + HD 2 (i + 1) 2 T -1 t=i+1 (1 -η t ) ≤ HD 2 T + T -1 i=1 O (HD + G) D 1 (i + 1) 1/2 + √ d (i + 1) ρ + HD 2 (i + 1) 1 T ≤ O (HD + G) D 1 √ T + √ d T ρ + HD 2 T ≤ O (HD + G) D 1 √ T + √ d T ρ where the second inequality follows from smoothness and substituting T -1 t=i+1 (1 -η t ) = i+1 T -1 . Substituting number of iterations T = n completes the accuracy proof. For the unlearning part, we start by showing that the algorithm falls into the template of bounded sensitivity prefix-sum query release. Recall that the update u t = t i=1 ((i + 1) ∇ℓ(w i ; z i ) -i∇ℓ(w i-1 ; z i )). The sensitivity is then bounded as, ∥((i + 1) ∇ℓ(w i ; z) -i∇ℓ(w i-1 ; z)) -((i + 1) ∇ℓ(w i ; z ′ ) -i∇ℓ(w i-1 ; z ′ ))∥ ≤ iH ∥w i -w i-i ∥ + 2G ≤ iHη i-1 ∥v i-1 -w i-1 ∥ + 2G ≤ 2 (HD + G) where the first inequality follows from smoothness and Lipschitzness of the loss. The second inequality follows from the update in Algorithm 12 and the last inequality follows from the fact that the iterates remain in the set of diameter D. Hence the correctness of the unlearning algorithm follows from Theorem 1. For runtime, the training time, in terms of gradient computations is Θ(n). Therefor, using the fact that the relative unlearning complexity, from Theorem 1, is O(ρ), we have O(ρn) bound on expected unlearning runtime.

Lemma

8. The gradient estimation error E rt t+1 -∇L(w t ; D) 2 ≤ O (HD + G) 2 1 t+1 + d (t+1) 2 ρ 2 Proof. Note that d t := rt t+1 comprises of the original gradient estimate from Zhang et al. (2020) , say d t and the noise added by the binary tree mechanism, say ξ t . Hence, E ∥d t -∇L(w t ; D)∥ 2 = E d t -∇L(w t ; D) 2 + E ∥ξ t ∥ 2 ≤ O (HD + G) 2 t + 1 + log(n) i=1 dσ 2 (t + 1) 2 ρ 2 = O (HD + G) 2 1 t + 1 + d (t + 1) 2 ρ 2 where the first inequality follows from Lemma 2 in Zhang et al. (2020) with α = 1, and the fact that in the binary tree mechanism we add noise of variance σ at most log (n) times; the factor 1/(t + 1) 2 comes because the gradient estimate is r t /(t + 1) and r t is the binary tree response. The final equality follows by plugging in the value of σ.

E.2 DUAL AVERAGING

Algorithm 13 Dual averaging(t 0 ; T ) Input: Dataset S, loss function (w, z) → ℓ(w, z), steps T , {η t } t , 1: if t 0 = 1 then Permute dataset, initialize T , set w t0 = 0 end if 2: for t = 1 to T -1 do 3: u t = t i=1 ∇ℓ(w i ; z i ) 4: Append(u t , σ; T ) 5: r t = GetPrefixSum(t; T ) 6: w t+1 = Π W (w 0 -η t p t ) 7: Set(leaf(t), (u t , r t , w t , z t ) ; T ) 8: end for Output: w = w T Proof of Theorem 3. The accuracy guarantee directly follows from Theorem 5.1 in Kairouz et al. (2021) , replacing ϵ/log 2 (1/δ) 2 therein by ρ. To elaborate, we set σ = O G 2 ρ 2 as opposed to O G 2 log 4 (1/δ) ϵ 2 , hence substituting it in the accuracy proof of Theorem 5.1 in Kairouz et al. (2021) gives the claimed guarantee. For the unlearning part, we start by showing that the algorithm falls into the template of bounded sensitivity prefix query release. Recall that the update u t = t i=1 ∇ℓ(w t ; z i ). The sensitivity is simply bounded by Lipschitznes as, ∥∇ℓ(w t ; z) -∇ℓ(w t ; z ′ )∥ ≤ 2G Hence the correctness of the unlearning algorithm follows from Theorem 1. For runtime, the training time, in terms of gradient computations is Θ(n). Therefor, using the fact that the relative unlearning complexity, from Theorem 1, is O(ρ), we have O(ρn) bound on expected unlearning runtime.

E.3 CONVEX GLMS WITH THE JL METHOD

Theorem 5. Let ρ ≤ 1 and ℓ : W × X × Y → R be an H-smooth, G-Lipschitz convex GLM loss function. Algorithm 4 instantiated with Algorithm 12, as the learning algorithm, run with σ 2 = O (H∥X ∥ 2 ∥w * ∥+G∥X ∥) 2 ρ 2 , t 0 = 1, η t = 1 t+1 and k = O H∥X ∥ 2 ∥w * ∥ (H∥X ∥ 2 ∥w * ∥+G∥X ∥) 2/3 (nρ) 2/3 on a dataset S of n samples, drawn i.i.d. from D, outputs w with excess population risk bounded as E [L( w; D) -L(w * ; D)] = O (G∥X ∥+H∥X ∥ 2 ∥w * ∥)∥w * ∥ √ n + H 1/3 G 2/3 ∥w * ∥ 4/3 ∥X ∥ 4/3 +H∥X ∥ 2 ∥w * ∥ 2 (nρ) 2/3 Furthermore, the corresponding unlearning Algorithm 3 (with query and update functions as specified in the learning algorithm), satisfies exact unlearning with O (ρn) expected runtime . Proof of Theorem 5. We start with the accuracy guarantee. Let α ≤ 1 be a parameter to be set later. From the JL property, with k = O log (n/β) /α 2 , with probability at least 1 -β, the norm of all data-points in S, ∥Φx i ∥ ≤ (1 + α) ∥x i ∥ ≤ 2 ∥X ∥. Hence, conditioned on the above event, the GLM loss function function is G = 2G ∥X ∥-Lipschitz and H = 4H ∥X ∥ 2 -smooth. Let ΦD denote the push-forward measure of D under the map (x, y) → (Φx, y). With probability at least 1 -β, the excess risk is, E[L( w; D) -L(w * ; D)] = E[L(Φ ⊤ w; D) -L(Φw * ; ΦD)] + E[L(Φw * ; ΦD) -L(w * ; D)] = E[L( w; ΦD) -L(Φw * ; ΦD)] + E[ϕ y (⟨Φw * , Φx⟩) -ϕ y (⟨w * , x⟩)] ≤ O G + H ∥w * ∥ ∥w * ∥ 1 √ n + √ k nρ + H 2 E |⟨Φx, Φw * ⟩ -⟨x, w * ⟩| 2 ≤ O G + H ∥w * ∥ ∥w * ∥ 1 √ n + √ k nρ + H ∥w * ∥ 2 k = O   G + H ∥w * ∥ ∥w * ∥ √ n + H 1/3 G 2/3 ∥w * ∥ 4/3 + H ∥w * ∥ 2 (nρ) 2/3   where in the first inequality, we use the accuracy guarantee of VR-Frank Wolfe (Theorem 2) and smoothness of ϕ y together with the fact that w * is globally optimal. The second inequality follows from JL property and the last inequality follows by the setting of k. For the in-expectation (over the JL matrix) bound, note that in the worst-case, L( w; D) -L(w * ; D) ≤ ∥ w -w * ∥. From boundedness of the range of (typical) JL maps, ∥ w -w * ∥ = poly(n, d) w.p. 1. Hence, taking the failure probability β to be small enough suffices to be give an expectation bound which is same as above upto polylogarithmic factors. We now proceed to the unlearning guarantee. We first remark that the correctness of the unlearning algorithm (see Lemma 4) holds as long as the learning algorithm uses prefix-sum queries, even with unbounded sensitivity. Hence, the correctness follows. We now proceed to bound the unlearning runtime. We first bound the TV stability parameter of the learning algorithm using Lemma 9. The setting of noise variance σ in Algorithm 4 together with the stability guarantee of Theorem 2 ensures that γ( H, G) ≤ τ 2 . Hence the JL method satisfies ρ-TV stability. Now, Lemma 6 gives us that the probability of retraining is at most O(ρ). Since the training time, in terms of gradient computations is Θ(n), we have O(ρn) bound on expected unlearning runtime. Theorem 6. Let ρ ≤ 1 and ℓ : W × X × Y → R be a G-Lipschitz convex GLM loss function. Algorithm 4 with Algorithm 13 as the sub-routine, as the learning algorithm, run with σ 2 = O G 2 ∥X ∥ 2 ρ 2 , t 0 = 1, η = ∥w * ∥d 1/4 √ log(n) G∥X ∥ √ nρ and k = √ nρ on a dataset S of n samples sampled i.i.d. from D outputs w, with excess population risk bounded as, E [L( w; D) -L(w * ; D)] = O G ∥X ∥ ∥w * ∥ 1 √ n + 1 (nρ) 1/3 . Furthermore, the corresponding unlearning Algorithm 3 (with query and update functions as specified in the learning algorithm), satisfies exact unlearning with O (ρn) expected runtime. Proof of Theorem 6. We start with the accuracy guarantee; let α ≤ 1 be a parameter to be set later. From the JL property, with k = O log (n/β) /α 2 , with probability at least 1 -β, the norm of all data-points in S, ∥Φx i ∥ ≤ (1 + α) ∥x i ∥ ≤ 2 ∥X ∥. Hence, conditioned on the above event, the GLM loss function function is G = 2G ∥X ∥-Lipschitz. Let ΦD denote the push-forward measure of D under the map (x, y) → (Φx, y). With probability at least 1 -β, the excess risk is, E[L( w; D) -L(w * ; D)] = E[L(Φ ⊤ w; D) -L(Φw * ; ΦD)] + E[L(Φw * ; ΦD) -L(w * ; D)] = E[L( w; ΦD) -L(Φw * ; ΦD)] + E[ϕ y (⟨Φw * , Φx⟩) -ϕ y (⟨w * , x⟩)] ≤ O   G ∥w * ∥   1 √ n + √ k nρ     + GE |⟨Φx, Φw * ⟩ -⟨x, w * ⟩| ≤ O   G ∥w * ∥   1 √ n + √ k nρ   + G ∥w * ∥ √ k   ≤ O G ∥w * ∥ 1 √ n + 1 (nρ) 1/3 where in the first inequality, we use the accuracy guarantee of Dual Averaging (Theorem 3) and Lipschitzness of ϕ y together. The second inequality follows from JL property and the last inequality follows by the setting of k. As in Theorem 5, the same bound as above for in-expectation (over the JL matrix) holds follows by taking the failure probability β to be small enough. The correctness and runtime of the unlearning algorithm follows as in the proof of Theorem 5. Lemma 9. Suppose A is an algorithm which when run on H-smooth and G-Lipschitz functions is γ( H, G)-TV stable, then the JL method with with k = O (log (2n/τ )) and A as input, run on H-smooth and G-Lipschitz GLMs, satisfies τ 2 + γ 2G ∥X ∥ , 4H ∥X ∥ 2 -TV stability. Proof. Given a dataset S let G S be the uniform bound on Lipschitzness parameter of the class of loss functions {w → ℓ(w; z)} z∈S . We define H S similarly. Let α ≤ 1 be a parameter to be set later. From the JL property, with k = O (log (n/β)), with probability at least 1 -β, the norm of all data-points in S, ∥Φx i ∥ ≤ 2 ∥X ∥ -we denote this event as E JL . Since the loss function is a GLM, we have that conditioned on E JL , the Lipschitzness and smoothness parameters G S and H S are bounded by 2G ∥X ∥ and 2H ∥X ∥ 2 respectively. We therefore get a stability parameter γ := γ 2G ∥X ∥ , 4H ∥X ∥ 2 . We set β = ρ/2. We now incorporate the failure probability in the failure guarantee. Let P Φ and Q Φ denote the probability distributions of the output on datasets S and S ′ . By definition of TV distance, TV(P Φ , Q Φ ) = sup E P w∼P (w ∈ E) -P w∼Q (w ∈ E) = sup E P w∼P (w ∈ E|E JL ) P(E JL ) + P w∼P (w ∈ E|E ′ JL ) P(E ′ JL ) -P w∼Q (w ∈ E|E JL ) P(E JL ) -P w∼Q (w ∈ E|E ′ JL ) P(E ′ JL ) ≤ sup E P w∼P (w ∈ E|E JL ) -P w∼Q (w ∈ E|E JL ) P(E JL ) + sup E P w∼P (w ∈ E|E ′ JL ) -P w∼Q (w ∈ E|E ′ JL ) P(E ′ JL ) ≤ sup E P w∼P (w ∈ E|E JL ) -P w∼Q (w ∈ E|E JL ) + ρ/2 ≤ γ + ρ/2 which completes the proof.

F SCO IN DYNAMIC STREAMS

In this section, we extend our previous results to dynamic streams wherein we observe a sequence of insertions and deletions, starting with potentially zero data points. We assume that the number of available points throughout is positive and the data points are i.i.d. from an an unknown distribution as well as the unlearning requests are chosen independent of the algorithm. To give a simple and unified presentation, let the expected accuracy, say excess population risk, of the ρ-TV stable Algorithm 2 with a dataset S be α(|S| , ρ; P) where P denotes problem specific parameters such as Lipschitzness, diameter etc. We present two techniques for dynamic streams; the first one satisfies exact unlearning but has a worse update time; this is similar to Ullah et al. (2021) . The other satisfies weak unlearning (defined below) with better update time. A key component to both are anytime guarantees described below. Anytime binary tree mechanism: In the previous section, the depth of the initialized tree and the noise variance σ, both were chosen as a function of the dataset size n. However, the tree can be easily built in an online manner as in prior work Guha Thakurta & Smith (2013) . For setting the noise variance: for target ρ-TV stability, we distribute the noise budget exponentially along the height of the tree; specifically, the leaf node contribute to ρ/2 TV stability, the nodes above them ρ/4 and so on. In this way, the final tree satisfies ρ-TV stability for any value of n.

Anytime accuracy:

The other problem of changing data size is that the internal parameters of algorithm (step size, in our case) may be set as a function of n for desirable accuracy guarantees. Fortunately, the two algorithms that we consider, VR-Frank Wolfe and Dual Averaging, have known horizon-oblivious parameter settings Orabona (2019) . Their JL counterparts on the other hand, require setting the embedding dimension as a function of n, and thus not applicable unless we assume that the number of data points throughout the stream is Θ(n).

F.1 WEAK UNLEARNING

We first define weak unlearning wherein only the model output and not the entire state is required to be indistinguishable. Definition 9 (Weak unlearning). A procedure (A, U) satisfies weak unlearning if for all all datasets S, all z ∈ Z, and for all events E ⊆ W × M, we have, P (A (S\ {z}) ∈ E) = P (U (A(S), z) ∈ E) We now argue in what way insertions handled in Ullah et al. (2021) is deficient. The main reason is that they require insertions to also satisfy the unlearning criterion: the state of the system upon insertion is instinguishable to the state had the inserted point being present to begin with. However, this is an overkill; adding new points simply serve to yield improved statistical accuracy. Furthermore, methods which allow adding new points, abound, particularly in the stochastic optimization setting, popularly known as incremental or continual release methods. Importantly, the insertion time of these methods is constant (in n). Hence, a natural question is whether, for dynamic streams, can we design unlearning methods in which we pay for update time only in proportion to the number of deletions? Our result shows that we can, albeit under the weak unlearning (see Definition 9) guarantee. Specifically, our procedure requires hiding the order in which data points are processed. Intuitively, an incremental method typically processes the newest data point the last. This ordering is problematic to the unlearning guarantee, since if some point is to deleted, then we can no longer replace it with the last point as that would result in a different order. Our main result is as follows. Theorem 7. In the dynamic streaming setting with R requests, using anytime incremental learning and unlearning algorithms, Algorithm 2 and 3, without permuting the dataset, the following are true. 3. The number of times retraining is triggered, for V unlearning requests is at most O(ρV ) Importantly, in the above guarantee, we only pay for the number of unlearning requests V rather than the number of requests R. Proof of Theorem 7. The first claim, weak unlearning guarantee of the unlearning algorithm, follows mainly from Lemma 4. Specifically, it shows that conditioned on the permutation of the dataset (in this case, since the dataset is not permuted, the permutation is simply identity), the distribution over the responses (r b ) b in the tree after unlearning, is transported to the distribution of the output under S ′ . Since the model output is a deterministic function of the responses, (weak unlearning) correctness follows for one request. For the streaming setting, we simply apply the above inductively over the requests. The second claim follows since, at every time point, the executed algorithm is indistinguishable from the base algorithm executed over the current dataset. Moreover, by assumption, the base algorithm, is anytime, i.e. no parameter is set which depends on the size of the dataset. Hence, the accuracy guarantee follows. For the last claim about the number of retraining, firstly, as motivated, by the assumption that the algorithm is incremental, the insertions are handled in O(1) time. For the unlearning requests, note that from ρ-TV stability at every point, using Lemma 6, we have a O(ρ) probability of retraining. We now apply Proposition 8 from Ullah et al. (2021) which converts this to a bound on the expected number of times a retraining is triggered. For V unlearning requests, this gives us a O(ρV ) bound on the number of retraining triggers. F.2 EXACT UNLEARNING Another way to extend the results for one unlearning request to dynamic streams is to modify the definition of unlearning (Definition 1) to also hold for insertions, as is done in Ullah et al. (2021) . This allows us to apply the same tree based unlearning technique when handing insertions. Specifically, upon inserting a new point, we randomly choose a leaf and replace the leaf with the inserted point, and then insert the chosen leaf as the last leaf in the tree. We have the following guarantee for this method. Theorem 8. In the dynamic streaming setting with R requests, using anytime learning and unlearning algorithms, Algorithm 2 and 3, the following are true. 1. Exact unlearning at every time point in the stream. Proof. The arguments are similar to that of the proof of Theorem 7. The first part follows by applying the correctness of the unlearning algorithm, Theorem 1, inductively over the stream. We remark that the handling the insertions in the same way as deletions hardly changes anything in the proofs. The second claim follows from the anytime nature of the algorithm and by assumption on the accuracy guarantee. Finally, using the probability of retraining in Lemma 6 and Proposition 8 in Ullah et al. (2021) gives us the stated number of retraining triggers.



A real valued function w → f (x) is G-Lipschitz and H-smooth if |f (x1) -f (x1)| ≤ G ∥x1 -x2∥ an ∥∇f (x1) -∇f (x2)∥ ≤ H ∥x1 -x2∥ respectively.



be the path from l to root. 5: for b ∈ path do u b = u b -g ′ end for 6: Remove node l from T . 7: Let b = s and ct = 1 8: if j = |S| then let b = ∅ end if 9: while b ̸ = ∅ do 10:

Set(b, (u ′ , r ′ , ∅, ∅) ; T ) if b is left sibling then ct = ct + 2 |s|-|b|-1 end if 27: Set (new) b as binary representation of parent of b 28: end while 29: Update dataset S = S\ {z j } Output: w = S({w b } b ) 1. The learning Algorithm 2 with σ 2 = 64B 2 log 2 (n) ρ satisfies ρ-TV stability.

on a dataset S of n samples, drawn i.i.d. from D, outputs w with excess population risk bounded as E [L( w; D) -L(w * ; D)] = O GD 1

TV stability.Proof of Lemma 1. From Theorem 4 inVan Erven & Harremos (2014), we have that lim α↓1 D α (P ∥Q) = KL (P ∥Q), where KL(•∥•) denotes the Kullback-Leibler (KL) divergence between the two distributions. Finally, we relate the TV distance with the KL divergence using Bretagnolle-Huber bound(Bretagnolle & Huber, 1979;Canonne, 2022) which gives the claimed bound.

Federated Averaging (Server side) Input: Number of clients C, number of rounds T , client-specific parameters P 1: Initialize model w 1 ∈ W 2: for t = 1 to T -1 do 3:for c = 1 to C do 4:

Lloyd's algorithm Input: Number of clusters C, number of rounds T , dataset S = {z i } for Output: {w c } C c=1 C MISSING DETAILS FROM SECTION 4 In this section, we provide pseudo-code of the operations supported by the binary tree data structure. Algorithm 9 Append(u, σ; T ) Input: Query response u, noise variance σ,Tree T 1: Let s be the first empty leaf index 2: path = {s → • • • root} be the path from s to root. 3: UpdateTree(u, path, σ; T ) Algorithm 10 UpdateTree(u, path, σ; T ) Input: Query response u, Set of nodes path, noise variance σ,Tree T 1: for b ∈ path do 2:u b = u b + u 3: if b is a left child or b is a leaf then 4: ξ ∼ N (0, σ 2 I) (t; T ) Input: t ∈ N, Tree T , 1: Initialize g ∈ R p to 0 2: ← leaf(t)3: Let path be the path from s to root. 4: while b ̸ = ∅ do 5: if b is a leaf child or b is a leaf then 6: g = g + r b 7:

e ∈ R d : ∃e ∈ × >b R d : (r ≤b , e, e) ∈ E where × >b R d denote the Cartesian product of R d 's of upto > b but smaller than or equal to T (1) elements. Similarly, define E r ≤b ≥b as, E r ≤b ≥b = e ∈ × ≥b R d : (r ≤b , e) ∈ E Finally, define E <b as E <b = e ∈ × <b R d : ∃e ∈ × ≥b R d : (e, e) ∈ E

It satisfies weak unlearning at every time point in the stream. 2. The accuracy of the output w i at time point i, with corresponding dataset S i , is E[L( w i ; D)] -min w L(w; D) = α(ρ, |S i | ; P)

The accuracy of the output w i at time point i, with corresponding datasetS i , is E[L( w i ; D)] -min w L(w; D) = α(ρ, |S i | ; P)3. The total number of times, a retraining is triggered, for R requests is at most O(ρR)

Excess population risk guarantees for various problems as well as the base algorithm; ρ: relative unlearning complexity (see Definition 4), JL: Johnson-Lindenstrauss transform.

Algorithm 1 Template learning algorithm Input: Dataset S, steps T , query functions {q t (•)} t≤T where q t ∈ Q, a query class, update functions {U t (•)} t≤T , selector function S(•)

annex

node b, let P b (•|T ≤b ) denote the conditional distribution of the nodes given the prefix nodes of the tree.Let p be a permutation over [n] and p b denote the index on the b-th node, when b is a leaf. Let µ denote the probability, and conditional probability, depending on context, of p and p b , under the random permutation model. Specifically, we use µ(p) and µ(p b |p ≤b ) to denote the probability of the sequence p and conditional probability of p b given the previous values.Let T (1) denote the initial binary tree i.e. the one constructed after the algorithm is run on dataset S, and T (2) be the binary tree constructed after unlearning. Let P p and Q p denote the conditional distributions for P and Q respectively given permutation p.We factor the probability density of P as:b |p(1)Fixing the permutation sequence p (1) , denote and factor the conditional distribution as,Finally, define response trees T (1) and T (2) which only contain the response variables (r b ) b . Moreover, define distributions P , P b , P p , P p b andWe first show the the tree T produced by the learning algorithm is TV-stable., the response tree data structure T is ρ-TV stable.Proof. The proof of privacy of tree aggregation is classical in differential privacy, see Guha Thakurta & Smith (2013) for example. The proof has three ingredients: Gaussian mechanism guarantee, parallel composition (to argue that accounting along the height of the tree suffices) and adaptive composition (for accounting along the height of the tree). Since the noise is Gaussian and these composition properties also holds under RDP (Mironov, 2017) , therefore we can give an RDP guarantee of ϵ(α) ≤ log 2 (n) • 64αB 2 σ 2 αρ 2 . Finally, using Lemma 1 and a numerical simplification since ρ ≤ 1 gives the claimed result.Recall that j is the index of the data item (after permutation) which is deleted. Without loss of generality, assume that the original index of the deleted data-point is n. We first argue the following about the distribution of p (1) and p (2) . Lemma 3. For any set E ⊆ [n] n and any set E ′ ⊆ [n -1] n-1 , we haveProof. Since p (1) and p (2) are discrete distributions, it suffices to argue the above for the atoms. Firstly, by construction, p (1) ∼ µ n and hence the first part is done. For the second part for any sequencedenote the concatenation of h and j (the deleted index). By symmetry, the probability P p (2) (h) = 1 n + 1 P p (1) ([h, j]) = µ n-1 (h) This completes the proof.We now show transport of the conditional distribution by the unlearning operation.Lemma 4. For any measurable event E ⊆ R d|T (2) | , P T (2) ∈ E|p (1) , p (2) = Q p (2) (E).

