ROBUST ALGORITHMS ON ADAPTIVE INPUTS FROM BOUNDED ADVERSARIES

Abstract

We study dynamic algorithms robust to adaptive input generated from sources with bounded capabilities, such as sparsity or limited interaction. For example, we consider robust linear algebraic algorithms when the updates to the input are sparse but given by an adversary with access to a query oracle. We also study robust algorithms in the standard centralized setting, where an adversary queries an algorithm in an adaptive manner, but the number of interactions between the adversary and the algorithm is bounded. We first recall a unified framework of (Hassidim et al., 2020; Beimel et al., 2022; Attias et al., 2023) for answering Q adaptive queries that incurs O( √ Q) overhead in space, which is roughly a quadratic improvement over the naïve implementation, and only incurs a logarithmic overhead in query time. Although the general framework has diverse applications in machine learning and data science, such as adaptive distance estimation, kernel density estimation, linear regression, range queries, point queries, and serves as a preliminary benchmark, we demonstrate even better algorithmic improvements for (1) reducing the preprocessing time for adaptive distance estimation and (2) permitting an unlimited number of adaptive queries for kernel density estimation. Finally, we complement our theoretical results with additional empirical evaluations.

1. INTRODUCTION

Robustness to adaptive inputs or adversarial attacks has recently emerged as an important desirable characteristic for algorithm design. An adversarial input can be created using knowledge of the model to induce incorrect outputs on widely used models, such as neural networks (Biggio et Madry et al., 2018) . Adversarial attacks against machine learning algorithms in practice have also been documented in applications such as network monitoring (Chandola et al., 2009) , strategic classification (Hardt et al., 2016) , and autonomous navigation (Papernot et al., 2016; Liu et al., 2017; Papernot et al., 2017) . The need for sound theoretical understanding of adversarial robustness is also salient in situations where successive inputs to an algorithm can be possibly correlated; even if the input is not adversarially generated, a user may need to repeatedly interact with a mechanism in a way such that future updates may depend on the outcomes of previous interactions (Mironov et al., 2011; Gilbert et (2023). More recently, there have also been a few initial results for dynamic algorithms on adaptive inputs for graph algorithms (Wajc, 2020; Beimel et al., 2021; Bernstein et al., 2022) . These works explored the capabilities and limits of algorithms for adversaries that were freely able to choose the input based on previous outputs by the algorithm. However, in many realistic settings, an adversary is limited in its abilities. For example, adversarial attacks in machine learning are often permitted to only alter the "true" input by a small amount bounded in norm. For the L 0 norm, this restriction means that the adversary can only add a sparse noise to the true input. More generally, it seems reasonable to assume that adversarial input is generated from a source that has bounded computation time or bounded interactions with an honest algorithm.

1.1. OUR CONTRIBUTIONS

In this paper, we study algorithms robust to adaptive/adversarial input generated from sources with bounded capabilities. We first study dynamic algorithms for adaptive inputs from a source that is restricted in sparsity. Namely, we consider robust linear algebraic algorithms when the updates to the label can be adversarial but are restricted in sparsity. We then study robust algorithms in the standard centralized setting, where an adversary queries an algorithm in an adaptive manner, but the number of interactions between the adversary and the algorithm is bounded. We first show that combining novel subroutines for each of these problems in conjunction with a simple but elegant idea of using differential privacy to hide the internal randomness of various subroutines previously used by Hassidim et Dynamic algorithms on adaptive input for regression. Motivated by the problem of label shift in machine learning, we consider a dynamic version of least-squares regression, where the labels get updated. In this model, we are given a fixed design matrix and a target label that receives a sequence of updates. After each one, the algorithm is asked to output an estimate of the optimal least-squares objective. The goal of the algorithm is to maintain the objective value within a multiplicative factor (1 + ε) to the optimal. More specifically, the algorithm is given a fixed design matrix A ∈ R n×d with n ≥ d and an initial response vector (i.e., label) b (1) , which receives updates over time. We are interested in estimating the least-squares objective value F (A, b) = min x∈R d ∥Ax -b∥ 2 2 as the target label b undergoes updates. The updates to b are adaptively chosen by an adversary but can only affect at most K entries of b per step. Formally, on the i-th round: (1) The adversary provides an update to K entries of b (i-1) , possibly depending on all previous outputs of the algorithm. (2) The algorithm updates its data structure and outputs an estimate F i of F i = F A, b (i) . (3) The adversary observes and records the output F i . The goal of the adversary is to create a sequence of labels b (i) T i=1 that induces the algorithm to output an inaccurate estimate. To deal with adaptivity, a naïve idea is to treat each step as an independent least-squares regression problem. However, this approach uses a completely new approximation of the objective value for each update, which seems potentially wasteful. On the other hand, any randomness that is shared by computations over multiple updates can potentially be leveraged by the adversary to induce an incorrect output. Our main result is an algorithm that beats the naïve algorithm in this challenging, adaptively adversarial setting. We provide a general result with run-time dependence on n, d, K, and the number of nonzero entries in A, nnz(A). Theorem 1.1 (Informal; see Theorem 2.1). Let κ(A) = O(1) and ε ∈ (0, 1). There exists a dynamic algorithm that given adaptively chosen K-sparse updates to b and a fixed design matrix A ∈ R n×d , outputs a (1 + ε) approximation to the least-squares objective F (A, b (i) ) every round with high probability. The algorithm uses O K nnz(A)/ε 3 amortized time per step of update. Specifically, the update time is d 1.5 when K ≤ d and n = O(d) and square root of the input sparsity when K = O (1) . Notice that this significantly betters the naïve approach of treating each step independently and solving for the least-square objective, which requires nnz(A) + poly(d) time by sketching (Woodruff (2014) ). We mention that a recent work by Jiang et al. (2022) considers a row-arrival model for dynamic linear regression. Our setting is different since we allow arbitrary updates to the target label, whereas in their setting the design matrix undertakes incremental change. We note that their algorithm maintains a solution vector, while we focus on the cost only. Adaptive query framework. We then consider robust algorithms in the standard centralized setting, where an adversary queries an algorithm in an adaptive manner. In many key algorithmic applications, randomization is necessary to achieve fast query time and efficient storage. This necessitates the need for robust versions of these algorithm which can efficiently employ the power of randomness while also being accurate across multiple possibly correlated inputs. Our main parameters of interest are query time and the space used by a robust algorithm compared to their naïve, non-robust, counterparts. Formally, we define the model as a two-player game between an algorithm HonestAlg over a data set X and an adversary A that makes adversarial queries about X to HonestAlg. At the beginning of the game, HonestAlg uses pre-processing time to compute a data structure D from X to answer future queries from A. The game then proceeds in at most Q rounds for some predetermined Q, so that in the t-th round, where t ∈ [Q]: (1) A computes a query q t on X, which depends on all previous responses from HonestAlg. (2) HonestAlg uses D to output a response d t to query q t . (3) A observes and records the response d t . The goal of A is to formulate a query q t for which the algorithm HonestAlg produces an incorrect response d t . We remark that the algorithm may not have access to X, after constructing D, to respond to the query q t . On the other hand, A can use previous outputs to possibly determine the internal randomness of the data structure D and make future queries accordingly. In this case, the analysis of many randomized algorithms fails because it assumes that the randomness of the algorithm is independent of the input. Consequently, it does not seem evident how to handle Q adaptive queries without implementing Q instances of a non-adaptive data structure, i.e., each instance handles a separate query. Thus, a natural question to ask is whether a space overhead of Ω(Q) is necessary. As a preliminary benchmark, we show that a space overhead of Ω(Q) is unnecessary by giving a unified framework with only an O √ Q space overhead. Theorem 1.2. Given a data structure D that answers a query q with probability at least 3 4 using space S and query time T , there exists a data structure that answers Q adaptive queries, with high probability, i.e., 1 - To concretely instantiate our framework and state an example, we consider the adaptive distance estimation problem defined as follows. In the adaptive distance estimation problem, there exists a set X = {x (1) , . . . , x (n) } of n points in R d . We are also given an accuracy parameter ε > 0. A query is of the form q, and the algorithm must output a (1 + ε)-approximation to ∥x (i) -q∥ p for all i. The trivial solution of storing all n points and computing all n distances to a query point uses space and query time O (nd). pre-processing time, while permitting an arbitrary number of queries. By comparison, our data structure handles Q queries of approximate distances from a specified point in X, using query time O n+d ε 2 , pre-processing time O nd √ Q ε 2 , and space O (n+d) √ Q ε 2 . Thus, in the regime where d ≫ n √ Q, our data structure already improves on the work of Cherapanamjeri & Nelson (2020) . A noticeable weakness of our construction is that the Q queries return only the approximate distance between a query point and a single point in X, whereas Cherapanamjeri & Nelson (2020) outputs approximate distances to all points in X. Moreover, Cherapanamjeri & Nelson (2022) subsequently improve the pre-processing time to O nd ε 2 . Thus we open up our framework to (1) show that it can be further improved to handle the case where we return the approximate distances of all points in X from Q adaptive query points and (2) achieve pre-processing time O nd ε 2 . Theorem 1.3. There is a data structure which, when instantiated with dataset X = {x i } i∈[n] ⊂ R d and query bound Q ≤ d, answers any sequence of Q adaptively chosen distance estimation queries correctly with probability at least 0.99. Furthermore, the space complexity of the data structure is O(ε -2 • n √ Q) and the setup and query times are O(ε -2 • nd) and O(ε -2 • (n + d)), respectively. Another application of our framework is the adaptive kernel density estimation problem, where there exists a set X = {x (1) , . . . , x (n) } of n points in R d and the goal is to output a (1 + ε)-approximation to the quantity 1 n i∈[n] k(x (i) , q), for an accuracy parameter ε > 0, a query q, and a kernel function k, under the promise that the output is at least some threshold τ > 0. τ ε 2 points and then uses O d τ ε 2 query time to output the empirical kernel density. However, the analysis for both of these algorithms fails for the adaptive setting, where there can be dependencies between the query and the data structure. By using the data structure of Backurs et al. (2019) as a subroutine, our framework immediately implies an algorithm for adaptive kernel density estimation that uses O √ Q τ ε 2 space and O d log Q √ τ ε 2 query time to answer each of Q adaptive queries. In this case, we are again able to go beyond our framework and give a data structure that handles an unlimited number of adaptive kernel density queries: Theorem 1.4. Suppose the kernel function k is L-Lipschitz in the second variable for some L > 0, i.e., |k(x, y) -k(x, z)| ≤ L∥y -z∥ 2 for all x, y, z ∈ R d . Moreover, suppose that for all ∥x -y∥ 2 ≤ ρ, we have k(x, y) ≤ τ 3 . Then an algorithm that produces a kernel density estimation data structure D that is L-Lipschitz over a set X of points with diameter at most ∆ and outputs a (1+ε)-approximation to KDE queries with value at least τ with probability at least 1 -δ using space S(n, ε, τ, log δ) and query time T (n, ε, τ, log δ), then there exists a KDE data structure that with probability at least 0.99, outputs a (1 + ε)-approximation to any number of KDE queries with value at least τ using space Additionally, we show that our framework guarantees adversarial robustness for a number of other important problems such as nearest neighbor search, range queries, point queries, matrix-vector norm queries, and linear regression. Finally, we supplement our theoretical results with a number of empirical evaluations, which are in the appendix.

1.2. OUR TECHNIQUES

Dynamic regression on adaptive inputs. Our dynamic algorithm for dynamic maintenance of least-squares objective exploits two main ideas. First, standard results in sketching and sampling show that it suffices to solve for the sketched objective of min x∈R d ∥SAx -Sb∥ 2 2 , where S is an ℓ 2 subspace embedding for A. Here, we exploit several techniques from numerical linear algebra and in particular use leverage score sampling to obtain a subspace embedding S of A. By standard results in sketching, a (1 + ε) optimal solution is given by x * = (SA) † Sb. Moreover, since the goal is to output the objective value instead of the solution vector, we may take a Johnson-Lindenstrauss (JL) sketch to further reduce dimensionality and run-time. This allows us to focus on ∥GAx * -Gb∥ 2 2 , where G ∈ R O(log d)×n is a JL sketch. As a result, our algorithm dynamically maintains a solution GA(SA) † b in this sketched space. To achieve that, we first explicitly solve GA(SA) † in pre-processing. Since GA has few rows, this reduces to a small number of linear solves and can be computed fast via conjugate gradient-type methods. To handle the updates, we leverage their sparsity to efficiently maintain the solution and show that each round takes roughly O (K) time. Amortizing the pre-processing with the update costs over all iterations yields our desired run-time. Finally, we apply techniques from differential privacy to ensure adversarial robustness, by aggregating independent copies of the algorithm via a private median mechanism. Intuitively, the private mechanism hides the internal randomness of the algorithm and therefore prevents the adversary from otherwise choosing a "bad" input based on knowledge of internal parameters.

Adaptive query framework. Our framework maintains O √

Q instances of the non-adaptive data structure and crucially uses differential privacy (DP) to protect the internal randomness of the data structures. In addition to our previous results for dynamic regression, the technique of using DP to hide randomness has recently been used in the streaming model (Hassidim et al., 2020; Kaplan et al., 2021 ) and the dynamic model (Beimel et al., 2021) . These works elegantly use the advanced composition property of DP to bound the number of simultaneous algorithms that must be used in terms of the number of times the output changes "significantly" over the course of the stream. In the streaming model, the robust algorithms proceed by instantiating many "hidden" copies of a standard randomized algorithm. As the stream arrives, the algorithms are updated and an answer, aggregated using DP, is reported. Crucially, many of these results exploit the fact that the output answer is monotonic in the stream so that there is a known upper bound on the final output. Thus, the reported answers can only increase by a multiplicative factor at most a logarithmic number of times, which is used to bound the initial number of algorithms which are initialized. In our centralized setting, this can be imagined as setting the parameter Q. The main parameter of interest in the streaming literature is the space used by the streaming algorithms, whereas we are concerned with both space usage and query times. Furthermore, stream elements are only accessed one at a time and cannot be processed together unless memory is used. In our case, the dataset is given to us upfront and we can pre-process it to construct a data structure towards solving a centralized problem. The work by Beimel et al. (2021) shares many of these ideas: the authors are concerned with dynamic graph algorithms where an adversary can update the graph in an adaptive fashion. Similar tools such as multiple randomized initialization and aggregated responses using DP are utilized. The main difference is their parameters of interest: the goal of Beimel et al. (2021) is to have a fast amortized update time across many queries. This necessitates the need to "throw away" existing algorithms and start with fresh randomness at intermittent points. In contrast, we study a centralized setting where the underlying dataset is not updated but we wish to answer Q adaptive queries on the dataset. Inspired by these works, our main framework also uses advanced composition to show the sufficiency of maintaining O √ Q data structures to answer Q adaptive queries in the centralized setting, which gives a rich set of applications. Moreover, to improve the query time of our algorithms, we further invoke the privacy amplification of sampling to show that it suffices to output the private median of a small subset, i.e., a subset of size O (log Q), of these O √ Q data structures. Thus our framework only incurs a logarithmic overhead in query time and an O √ Q overhead in space. Surprisingly, our simple framework gives diverse applications for adaptive algorithms on a number of important problems, including estimating matrix-vector norms, adaptive range query search, adaptive nearest neighbor search, and adaptive kernel density estimation, to name a few. These applications are discussed in depth in Section C. Adaptive distance estimation. To achieve better pre-processing time for adaptive distance estimation, our main technique is to sample groups of rows from a Hadamard transform and argue that an interaction with a separate group should be considered in separate privacy budgets, effectively arguing that outputting n approximate distances to a single adaptive query only uses one unit of privacy budget. By contrast, our black-box framework charges one unit of privacy budget per approximate distance, so that outputting n approximate distances would use n units of privacy budget. Adaptive kernel density estimation. Theorem 1.4 is based on showing that with constant probability, our data structure is accurate on all possible queries in R d . In particular, we first show that our data structure is accurate on a sufficiently fine net of points through a standard union bound argument, which incurs the d overhead compared to the space required to handle a single query. We then show that if the algorithm and the kernel function are both Lipschitz, which is true for sampling-based algorithms and a number of standard kernel functions, then accuracy on the net implies accuracy on all possible points in R d .

2. DYNAMIC REGRESSION UNDER LABEL UPDATES

In this section, we consider the dynamic problem of maintaining the cost of the least-squares regression, where the labels receive adaptively chosen updates. Let A ∈ R n×d be the design matrix and b ∈ R n be the target label. A classic problem in numerical linear algebra and optimization is to solve the ℓ 2 least-squares regression objective F (A, b) = min x∈R d ∥Ax -b∥ 2 2 = AA † b -b 2 2 . (2.1) We consider a dynamic version of the problem, where the label receives adaptively chosen updates. We assume that each update can only affect K entries of the label vector. In this setting, we show: Theorem 2.1 (Main theorem; dynamic maintenance of regression cost). Let ε ∈ (0, 1/4) be an error parameter and b (1) be the initial target label. Given ε, A, b (1) , a stream of T adaptively chosen, K-sparse updates to the label, Algorithm 4 outputs an estimate F i such that F i = (1 ± ε)F (A, b (i) ) for all i with high probability. Furthermore, the algorithm requires a preprocessing step in time O (nnz(A) + poly(d)). The amortized update time of the algorithm is O K nnz(A) κ(A) + ε -3 . We defer the technical details to Appendix 3. Here, we describe the main ideas of the algorithm. At a high-level, our algorithm implements a sketch-and-solve strategy.  ; Woodruff (2014)). Let A = SA. A (1 + ε) optimal solution is thus given by A † b. Moreover, our goal is to maintain the regression cost, rather than this solution vector. Hence, we can apply the Johnson-Lindenstrauss lemma and focus on min x∈R d ∥SAx -Sb∥ 2 2 ≈ GA(SA) † Sb -Gb 2 2 , where G ∈ R O(log n/ε 2 )×n is a JL sketch. To track the cost value dynamically, the algorithm first computes and stores M = GA(SA) † in pre-processing. In the 1-st step, given the initial target label b (1) , the algorithm computes Sb (i) , M Sb (1) and Gb (1) . Then it outputs F 1 = MSb (1) -Gb (1) 2 2 as an estimate of the regression cost. For the later steps, we show how to maintain Gb (i) , Sb (i) efficiently, by exploiting the sparsity of the updates. Finally, to remain robust under adaptive inputs, we aggregate multiple copies using private median and carefully balance the parameters to achieve the run-time guarantee.

3. DETAILS ON DYNAMIC REGRESSION

In this section, we consider the dynamic problem of maintaining the cost of the least-squares regression, where the labels receive adaptively chosen updates. We first introduce the basic setting of the problem in Section 3.1. In Section 3.2, we design a key subroutine under non-adaptive updates. The data structure enjoys a nearly linear update time. This allows us to aggregate multiple copies of the procedure and thereby efficiently ensure adversarial robustness against an adaptive adversary. The argument is via an application of differential privacy and detailed subsequently in Section 3.3.

3.1. BASIC SETTING

Let A ∈ R n×d be the design matrix and b ∈ R n be the target label. A classic problem in numerical linear algebra and optimization is to solve the ℓ 2 least-squares regression objective F (A, b) = min x∈R d ∥Ax -b∥ 2 2 = AA † b -b 2 2 . (3.1) We consider a dynamic version of the problem, where the design matrix A remains unchanged. However, at each step (at most) K entries of b undergo an update. Moreover, we assume that the updates are chosen adaptively by an adversary in the following manner. • The algorithm starts by receiving the input A ∈ R n×d and b (1) ∈ R n . • In the i-th step, the algorithm outputs an estimate F i of the cost F (A, b (i) ), where b (i) is the target label corresponding to the step. • The adversary observes F i and updates at most K labels to form b (i) . Let b (1) , b (2) , . . . , b (T ) ∈ R n be the resulting sequence of labels over T steps. The goal of the algorithm is to output a (1 + ε) approximation to the optimal cost at every step, while minimizing the update time.

3.2. DYNAMIC ALGORITHM FOR OBLIVIOUS INPUTS

In this section, we provide a key subroutine that maintains a data structure under oblivious updates. On a high-level, the data structure aims to enable a sketch-and-solve strategy dynamically. The main ideas are two fold: (1) apply randomized sketching to reduce dimensionality and therefore the run-time, and (2) exploit the sparsity of the updates to argue that the regression costs can be maintained efficiently. Before delving into the technical details, we give an overview of the algorithm. Overview of the algorithm. We start by assuming that the algorithm has access to D LS (via Lemma A.11), the row leverage score sampling data structure for A. Woodruff (2014) . Let A = SA. Then a (1 + ε) optimal solution is thus given by A † b. Moreover, our goal is to maintain the regression cost, rather than this solution vector. Hence, we can apply Johnson-Lindenstrauss lemma and focus on min x∈R d ∥SAx -Sb∥ 2 2 ≈ GA(SA) † Sb -Gb 2 2 , where G ∈ R O(log n/ε 2 )×n is a JL sketch. Next, we describe how to track the cost value dynamically. We stress that the sketching matrices S and G are sampled upfront in the preprocessing stage and remain fixed afterwards. The algorithm stores G and M = GA(SA) † , both computed in preprocessing. Meanwhile, it maintains Gb (i) , Sb (i) , initialized at i = 1. In the first step, given the initial target label b (1) , the algorithm computes Sb (i) , M Sb (1) and Gb (1) . Then it outputs F 1 = MSb (1) -Gb (1) 2 2 as an estimate of the regression cost. Let's consider the i-th step, where the label is updated to b (i) . First, we read the K labels that get changed and update Sb (i-1) to Sb (i) accordingly. This can be done in O(K) time. Finally, we simply compute M(Sb (i) ) and Gb (i) and output F i = MSb (i) -Gb (i) 2 2 . We store Gb (i) for the next iteration. We now describe the algorithm formally, followed by an analysis of its run-time and accuracy. Formal description of the algorithm. We assume D LS for A is given. The data structure is initialized by drawing the sketching matrices G and S. We also compute M = GSA(SA) † in preprocessing. This matrix is stored explicitly throughout. Algorithm 1 Initialize the data structure, i.e., preprocessing At each step, the algorithm computes Sb (i) by reading all K entries of b i-1 that are updated in the step. After that, compute M(Sb (i) ) and Gb (i) and output Mb (i) -Gb (i) 2 2 . The algorithm is formally given by Algorithm 2. Input: Design matrix A ∈ R n×d , initial label b (1) ∈ R n , D LS , ε ∈ (0, 1) 1: Let k = Θ d log d/ε 2 2: Sample a (1 + ε/2) ℓ Algorithm 2 Update data structure and maintain regression cost Input: Matrices G ∈ R Cε -2 log n×n , S ∈ R k×n , M ∈ R O(1/ε 2 )×k and the label b (i) Output: Estimate of the regression cost F A, b (i) 1: Compute Sb (i) by reading all K entries of b (i-1) that are updated. 2: Compute M Sb (i) and Gb (i) . ▷Store MSb (i) , Sb (i) , Gb (i) for the next round. 3: Output F i = MSb (i) -Gb (i) 2 2 . Analysis of the algorithm. We now analyze the run-time of the algorithm. First, consider the preprocessing stage performed by Algorithm 1. Lemma 3.1 (Preprocessing time). Assuming access to the leverage score sampling data structure D LS , the preprocessing time of Algorithm 1 is Proof. First, the algorithm reads the K entries that are updated and compute the Sb (i) from Sb (i-1) . This step takes O(K) time, since we just need to update the entries that lie in the support of the row sampling matrix S. Similarly, in step 2 of Algorithm 2 we can update O κ(A) nnz(A) log 1 ε + nnz(A) ε 2 log n + d ε 2 log n . ( Gb (i-1) to Gb (i) in O(K log n/ε 2 ) time. Since S is a row sampling matrix and b (i) only has K entries updated, then Sb (i) has at most K entries updated as well. It follows that given M Sb (i-1) from the prior round, M Sb (i) can be updated in O K ε 2 log n time. Lemma 3.3 (Accuracy). Given a stream of T = O(d 2 ) non-adaptive updates and error parameter ε ∈ (0, 1/4), Algorithm 2 outputs an estimate F i of the regression cost F (A, b (i) ) such that F i = (1 ± ε)F (A, b (i) ) for all i with high probability. Proof. First, we apply the subspace embedding property of S. This implies that with high probability, min x SAx -Sb (i) 2 2 = (1 ± ε/2) min x Ax -b (i) 2 2 . Apply the JL lemma (Theorem A.6), where we consider the collection of O d 2 (1 + ε) optimal predictions {y * i } T i=1 with y * i = A(SA) † b (i) . Via union bound, we have that with high probability for all i ∈ [T ] Gy * i -Gb (i) 2 2 = (1 ± ε/2) y * i -b (i) 2 2 . Our algorithm precisely solves for y * i each iteration. Combining the two equations above finishes the proof.

3.3. DYNAMIC ALGORITHM WITH ADVERSARIAL ROBUSTNESS

To put everything together and ensure adversarial robustness, we use a standard approach of . Then at each step, we output the private median of the outputs of these copies. Advanced composition of DP ensures robustness up to T rounds. Afterwards, the algorithm reboots by rebuilding the copies, using fresh randomness independently for sampling and computing the sketching matrices.

Algorithm 3 Preprocessing step for Algorithm 4

Input: A design matrix A ∈ R n×d , an approximation factor ε ∈ (0, 1). Output: The leverage score sampling data structure D LS for A. For each query, return the output of PrivMed. Theorem 3.4. [Main theorem; dynamic maintenance of regression cost] Let ε ∈ (0, 1/4) be an error parameter and b (1) be the initial target label. Given ε, A, b (1) , a stream of T adaptively chosen, K-sparse updates to the label, Algorithm 4 outputs an estimate F i such that F i = (1 ± ε)F (A, b (i) ) for all i with high probability. Furthermore, the algorithm requires a preprocessing step in time O (nnz(A) + poly(d)). The amortized update time of the algorithm is O K nnz(A) κ(A) + ε -3 per round. We defer the proof of Theorem 3.4 and a discussion on a deterministic algorithm to Section B. 2021)). Let ε ∈ (0, 1/3), δ ∈ (0, ε/4), and n ≥ 1 ε 2 log 2ε δ . Suppose A : X n → 2 X is an (ε, δ)-differentially private algorithm that curates a database of size n and produces a function h : X → {0, 1}. Suppose D is a distribution over X and S is a set of n elements drawn independently and identically distributed from D. Then Pr S∼D,h←A(S) 1 |S| x∈S h(x) -E x∼D [h(x)] ≥ 10ε < δ ε . A.2 NUMERICAL LINEAR ALGEBRA Our results on dynamic regression relies upon some tools in numerical linear algebra. We first recall the dimensionality reduction techniques. Theorem A.6 (Johnson-Lindenstrauss transformation, ε-JL). Given ε > 0, there exists a family of random maps such that for any x ∈ R d , we have Π m,d ∈ R m×d with m = O 1 ε 2 such that for any x ∈ R d , we have Pr Π∼Π m,d [(1 -ε)∥x∥ 2 ≤ ∥Πx∥ 2 ≤ (1 + ε)∥x∥ 2 ] ≥ Pr Π∼Π m,d [(1 -ε)∥x∥ 2 ≤ ∥Πx∥ 2 ≤ (1 + ε)∥x∥ 2 ] ≥ 3 4 . Moreover, Πx takes O log d ε 2 + d log d time to compute. A row sampling matrix S has rows that are multiples of natural basis vectors, so that SA is a (weighted) sample of the rows of A. A column sampling matrix is defined similarly. The size of a row/column sampling matrix is defined as the number of rows/columns it samples. The leverage score of the ith row a ⊤ i of A is τ i (A) def = a ⊤ i A ⊤ A + a i . For a survey on leverage score and applications, we refer the reader to Mahoney (2011). Definition A.8 (Leverage score sampling). Let u be a vector of leverage score overestimates, i.e., τ i (A) ≤ u i . Let α be a sampling rate parameter and c be a fixed positive constant. For each row, we define a sampling probability p i = min {1, α • u i c log d}. The leverage score sampling matrix is a row sampling matrix S with independently chosen entries such that S ii = 1 √ pi with probability p i and 0 otherwise. Definition A.9 (Subspace embedding). A (1 ± ε) ℓ 2 subspace embedding for the column space of an n × d matrix A is a matrix S for which for all x ∈ R d ∥SAx∥ 2 2 = (1 ± ε)∥Ax∥ 2 2 . Theorem A.10 (Leverage sampling implies subspace embedding, Theorem 17 of Woodruff (2014)). Let α = ε -2 and c be a sufficiently large constant. With high probability, the leverage score sampling matrix is a (1 ± ε) ℓ 2 subspace embedding. Furthermore, it has size O d log d/ε 2 . The approximate leverage scores can be computed in input-sparsity time. Afterwards, repeated sampling from the leverage score distribution can be done efficiently using the binary tree data structure in quantum-inspired numerical linear algebra. Furthermore, there exists a sampling data structure D LS that stores the row leverage scores of A such that given a positive integer m ≤ n, returns a leverage score sample of A of size m in O (m log(mn)) time. In total, the pre-processing takes O (nnz(A) + poly(d)) time.

B ADDITIONAL DETAILS ON DYNAMIC REGRESSION

Theorem 3.4. [Main theorem; dynamic maintenance of regression cost] Let ε ∈ (0, 1/4) be an error parameter and b (1) be the initial target label. Given ε, A, b (1) , a stream of T adaptively chosen, K-sparse updates to the label, Algorithm 4 outputs an estimate F i such that F i = (1 ± ε)F (A, b (i) ) for all i with high probability. Furthermore, the algorithm requires a preprocessing step in time O (nnz(A) + poly(d)). The amortized update time of the algorithm is O K nnz(A) κ(A) + ε -3 per round. Proof. We focus on any fixed epoch of T iterations. Let {A i } Γ i=1 be the collection of Γ data structures maintained by the Algorithm 4 and T i be the transcript between Algorithm 4 and the adversary at round i, consisting of the algorithm's output and the update requested by the adversary. To handle a sequence of T adaptive queries, consider the transcript T (R) = {T 1 , . . . , T T }, where R denotes the internal randomness of Algorithm 4. Note that for a fixed iteration, T i is O T log(nT ) , 0 -differentially private. By the advanced composition of differential privacy, i.e., Theorem A.4, the transcript T is O (1) , 1 poly(n) -differentially private with respect to the algorithms A 1 , . . . , A Γ . Algorithm 4 runs Γ instances of the data structure with error parameter ε. For any given round i ∈ [T ], we say that an instance j ∈ [Γ] is correct if its output f i,j is within a (1 ± ε) factor of F (A, b (i) ) and incorrect otherwise. For a fixed i, let Y j be the indicator variable for whether f i,j is correct. From the generalization properties of differential privacy, i.e., Theorem A.5, we have that for any fixed iteration i, Pr   1 Γ j∈[Γ] Y j -E [Y ] ≥ 1 10   < 1 poly(m, T ) , where Y denotes the indicator random variable for whether a random instance of the algorithm A (not necessarily restricted to the m instances maintained by the algorithm) is correct at the given round i. Since a random instance A has randomness that is independent of the adaptive update, then E [Y ] ≥ 3 4 . Therefore, by a union bound over all T rounds, we have Pr   1 Γ i∈[Γ] Y i > 0.6   > 1 - 1 poly(m, T ) , which implies that the output on the ith round is correct with probability at least 1 -1 poly(m,T ) , since T = d. Then by a union bound over i ∈ [T ] for all T rounds within an epoch, we have that the data structure answers all T queries with probability 1 -1 m 2 , under the adaptively chosen updates. Finally, by a union bound over all m updates, we have that the algorithm succeeds with probability at least 1 -1 m . We now analyze the run-time of the algorithm. The preprocessing time follows from the guarantee of Lemma A.11. For update time, we amortize over each epoch. Within an epoch, we invoke Γ = O √ T log(nT ) copies of the data structure in Section 3.2, and so we consider the preprocessing and update time from there and amortize over the epoch length T . By Lemma 3.1, each copy takes β = O κ(A) nnz(A) log 1 ε + nnz(A) ε 2 log n + d ε 2 log n time to pre-process. For every step of update, each copy takes O K ε 2 log n time by Lemma 3.2. Therefore, the amortized update time for every epoch of length T = O nnz(A) ε 2 K is O 1 T Γβ + ΓT K ε 2 log n = O K nnz(A) κ(A) + ε -3 . This completes the proof.

B.1 AN EXACT AND DETERMINISTIC ALGORITHM

We now give a simple deterministic algorithm for the dynamic regression problem based on an SVD trick. Let A = UΣV ⊤ be the SVD of A, where U ∈ R n×d , Σ ∈ R d×d and V ∈ R d×d . The starting observation is that for any solution vector x, we can write the regression cost as ∥Ax -b∥ = UΣV ⊤ x -b = ΣV ⊤ x -U ⊤ b , (B.1) since U is orthonormal. The goal is the maintain the solution vector x = A † b and the associated right-side quantity ΣV ⊤ x -U ⊤ b . Now suppose we compute A † ∈ R d×n and U ⊤ ∈ R d×n in pre-processing, and A † b (1) and U ⊤ b (1) in the first round. Then since all subsequent updates to b are all K-sparse, we only pay O(dK) time per step to maintain A † b (i) and U ⊤ b (i) .

Algorithm 5 A simple SVD-based algorithm for dynamic regression

Input: Design matrix A ∈ R n×d , its pseudoinverse A † ∈ R d×n and its SVD A = UΣV ⊤ , a sequence of labels b (i) ∈ R n 1: Compute and store SVD A = UΣV ⊤ , where U ∈ R n×d , Σ ∈ R d×d , V ∈ R d×d 2: Compute and store A † from the SVD. ▷In the 1st-round, compute and store A † b (1) , U ⊤ b (1) . 3: for each update b (i) do 4: Update and store x (i) = A † b (i)

5:

Update and store U ⊤ b (i)

6:

Output F i = ΣV ⊤ x (i) -U ⊤ b (i) 2 2 The algorithm is formally given by Algorithm 5. Observe that the algorithm always maintains the exact optimal regression cost. Moreover, the procedure does not require any randomness, and therefore it is adversarially robust to adaptive inputs. We formally claim the following guarantees of the algorithm.

Theorem B.1 (Deterministic maintenance of regression costs).

Given A, b (1) and a stream of adaptively chosen, K-sparse updates to the label, Algorithm 5 takes O(dK) time to update and maintain the exact regression cost F (A, b (i) ) at all iterations i. The pre-processing requires an SVD of A, in O(n 2 d) time.

C A FRAMEWORK FOR ADVERSARIAL ROBUSTNESS

In this section, we describe the benchmark framework that enables Q adaptive queries to a data structure by using O √ Q copies of a non-adaptive data structure. The framework and corresponding analysis of correctness are simply compartmentalizations of the techniques in Hassidim 2023). For the sake of completeness, we include them here and discuss additional applications. Namely, we show that through advanced composition of differential privacy, the private median of O √ Q copies protects the internal randomness of each non-adaptive data structure while still adding sufficiently small noise to guarantee accuracy. Moreover, we use amplification of privacy by sampling to only consider a small subset of the O √ Q non-adaptive data structures to further improve the runtime. Algorithm 6 Adaptive Algorithm Interaction 1: r ← O √ Q log 2 (nQ) , k ← O (log(nQ)) 2: for i ∈ [r] do 3: Implement data structure D i on the input 4: for each query q i , i ∈ [Q] do For each j ∈ [k], let d i,j be the output of D Sj on query q i 7: d i ← PrivMed({d i,j } j∈[k] ), where PrivMed is (1, 0)-DP We first argue that Algorithm 6 maintains accuracy against Q rounds of interaction with an adaptive adversary. Let R = {R (0) , R (1) , . . . , R (r) }, where R (1) , . . . , R (r) denotes the random strings used by the oblivious data structures D 1 , . . . , D r and R (0) denotes the additional randomness used by Algorithm 6, such as in the private median subroutine PrivMed. Consider a transcript T (R) = {T 1 , . . . , T Q } such that for each i ∈ [Q], we define T i = (q i , d i ) to be the ordered pair consisting of the query q i and the corresponding answer d i by Algorithm 6 using the random string R (0) , as well as the oblivious data structures D 1 , . . . , D r with random strings R (1) , . . . , R (r) . We remark that d i is a random variable due to the randomness of each data structure, as well as the randomness of the private median subroutine PrivMed. We will first argue that the transcript T R is differentially private with respect to R. We emphasize that similar arguments were made in the streaming model by Hassidim Proof. We first observe that PrivMed is (1, 0)-differentially private on the outputs of the r = O √ Q log 2 (nQ) data structures. Algorithm 6 samples k = O (log(nQ)) groups of data structures from the r total data structures. Thus by amplification via sampling, i.e., Theorem A.2, PrivMed is O 1 √ Q log(nQ) , 0 -differentially private. Therefore, T i is O 1 √ Q log(nQ) , 0 -differentially private with respect to R. We next argue that the entire transcript is differentially private with respect to the randomness R. Lemma C.2. T is O (1) , 1 poly(nQ) -differentially private with respect to R. Proof. By Lemma C.1, for each fixed iteration i ∈ [Q], the transcript T i is O 1 √ Q log(nQ) , 0differentially private with respect to R. Note that the transcript T is an adaptive composition of the transcripts T 1 , . . . , T Q . Thus, by the advanced composition of differential privacy, i.e., Theorem A.4, the transcript T is O (1) , 1 poly(nQ) -differentially private with respect to R. We now prove the correctness of our unifying framework. Proof of Theorem 1.2: For a fixed query q i with i ∈ [Q], let S be the corresponding set of k indices sampled from [r]. Let V be the set of valid answers on query q i . Let I j be an indicator variable for whether the output d i,j on query q i by D Sj is correct, so that I j = 1 if d i,j ∈ V and I j = 0 if d i,j / ∈ V. By assumption, we have that for each j ∈ [k], Pr [I j = 1] ≥ 3 4 , so that E [I j ] ≥ 3 4 . We define the random variable I = 1 k j∈[k] I j so that by linearity of expectation, E [I] = 1 k j∈[k] E [I j ] ≥ 3 4 . To handle a sequence of Q adaptive queries, we consider the transcript T (R) = {T 1 , . . . , T Q } for the randomness R = {R (0) , R (1) , . . . , R (r) } previously defined, i.e., for each i ∈ [Q], T i = (q i , d i ) is the ordered pair consisting of the query q i and the corresponding answer d i by Algorithm 6 using the random string R (0) , as well as the oblivious data structures D 1 , . . . , D r with random strings R (1) , . . . , R (r) . By Lemma C.2, we have that T is O (1) , 1 poly(nQ) -differentially private with respect to R. For j ∈ [k], we define the function success(R (Sj ) ) to be the indicator variable for whether the output d i,Sj by data structure D Sj is successful on query q i . For example, if D is supposed to answer queries within (1 + α)-approximation, then we define success(R (Sj ) ) to be one if d i,Sj is within a (1 + α)-approximation to the true answer on query q i , and zero otherwise. From the generalization properties of differential privacy, i.e., Theorem A.5, we have Pr   1 k j∈[k] success(R (Sj ) ) -E R success(R) ≥ 1 10   < 1 poly(n, Q) , for sufficiently small O (1). Therefore, by a union bound over all Q queries, we have Pr   1 k i∈[k] I i > 0.6   > 1 - 1 poly(n, Q) , which implies that d i is correct on query q i . Then by a union bound over i ∈ [Q] for all Q adaptive queries, we have that the data structure answers all Q adaptive queries with high probability. □ D APPLICATIONS OF OUR FRAMEWORK Theorem 1.2 has applications to a number of central problems in data science and machine learning. In this section, we formally describe the range queries, point queries, matrix-vector norm queries, and linear regression problems; we defer discussion of adaptive distance estimation, kernel density estimation, and nearest neighbor search to the the appendix.

D.1 APPLICATION: MATRIX-VECTOR NORM QUERIES

In the matrix-vector norm query problem, we are given a matrix A ∈ R n×d and we would like to handle Q adaptive queries x (1) , . . . , x (Q) for an approximation parameter ε > 0 by outputting a (1 + ε)-approximation to ∥Ax (i) ∥ p for each query x (i) ∈ R d with i ∈ [Q]. Here we define ∥v∥ p p = i∈[d] |v i | p for a vector v ∈ R d . Observe that computing Ax (i) explicitly and then computing its p-norm requires O (nd) time. Thus for n ≫ d, a much faster approach is to produce a subspace embedding, i.e., to compute a matrix M ∈ R m×d with m ≪ n, such that for all x ∈ R d , (1 -ε)∥Ax∥ p ≤ ∥Mx∥ p ≤ (1 + ε)∥Ax∥ p . However, because subspace embeddings must be correct over all possible queries, the number of rows of M is usually m = Ω d ε 2 due to requiring correctness over an ε-net. Theorem D.1 (Indyk (2006) ; Li (2008) ). Given A ∈ R n×d , p ∈ (0, 2], and an accuracy parameter ε > 0, there exists an algorithm that creates a data structure that uses O 1 ε 2 log n bits of space and outputs a (1 + ε)-approximation to ∥Ax∥ p for a query x ∈ R d , with high probability, in time O d ε 2 log n . Theorem D.1 essentially creates a matrix R ∈ R m×n of random variables sampled from p-stable distribution (Zolotarev, 1986) and then stores the matrix RA. Once the query x arrives, the data structure then outputs a (1 + ε)-approximation to ∥Ax∥ p by computing a predetermined function on RAx. The restriction on p ∈ (0, 2] is due to the fact that p-stable distributions only exist for p ∈ (0, 2]. From Theorem D.1 and Theorem 1.2, we have the following: Theorem D.2. Given A ∈ R n×d , p ∈ (0, 2], and an accuracy parameter ε > 0, there exists an algorithm that creates a data structure that uses O √ Q ε 2 log 2 (nQ) bits of space and outputs a (1 + ε)-approximation to ∥Ax (i) ∥ p with i ∈ [Q] for Q adaptive queries x (1) , . . . , x (Q) ∈ R d , with high probability, in time O d ε 2 log 2 (nQ) + log 3 (nQ) .

D.2 APPLICATION: LINEAR REGRESSION

In the linear regression problem, we are given a fixed matrix A ∈ R n×d and we would like to handle Q adaptive queries b (1) , . . . , b (Q) , for an approximation parameter ε > 0, by outputting a 2013)). Given A ∈ R n×d , b ∈ R n , and an accuracy parameter ε > 0, there exists an algorithm that creates a data structure that uses O d 2 ε 2 log 2 (nQ) bits of space and outputs a (1 + ε)-approximation to min x∈R d ∥Ax -b∥ 2 with high probability. (1 + ε)-approximation to min x∈R d ∥Ax -b (i) ∥ 2 for each query b (i) ∈ R n with i ∈ [Q]. However, this may fail for multiple interactions with the data structure. For example, suppose the adversary learns the kernel of S. Then the adversary could query some vector b (i) in the kernel of S so that Sb (i) is the all zeros vector, so that the output is the all zeros vector of dimension d, which could be arbitrarily bad compared to the actual minimizer. Thus the naïve approach is to maintain Q subspace embeddings, one for each query, resulting in a data structure with space O Qd ε 2 . By comparison, Theorem D.3 and Theorem 1.2 yield the following: Theorem D.4. Given A ∈ R n×d and an accuracy parameter ε > 0, there exists an algorithm that creates a data structure that uses O √ Qd 2 ε 2 log 3 (nQ) bits of space and with high probability, outputs (1 + ε)-approximations to min x∈R d ∥Ax -b (i) ∥ 2 for Q adaptive queries b (1) , . . . , b (Q) .

D.3 APPLICATION: HALF-SPACE QUERIES

Given a set P of n points in R d , the range query or search problem asks us to pre-process P so that given a region R, chosen from a predetermined family, one can quickly count or return the points in P ∩ R. This is an extremely well-studied class of problems in computational geometry Toth et al. (2017) and the case where the regions R are hyperplanes (also called half-spaces) is of special interest since many algebraic constraints can be "lifted" to be hyperplanes in a higher dimension. Unfortunately, exact versions of the problem are known to have the "curse of dimensionality" and suffer from exponential dependence on d in the query time (Brönnimann et al., 1993; Chazelle, 2000) . Nonetheless, Chazelle et al. (2008) gave a data structure capable of answering hyperplane queries approximately with polynomial query time. Their notion of approximation is as follows: given a set of points P in the unit ℓ 2 ball, hyperplane R, and ε > 0, we return the number of points that are on a given side of the hyperplane R up to additive error equal to the number of points in P which lie within distance ε of the boundary of R. We will refer to this query as an ε-approximate hyperplane query. Chazelle et al. (2008) proved the following theorem. Theorem D.5 (Chazelle et al. (2008) ). Given a set of points P that lie in the unit ℓ 2 ball, there exists a data structure that pre-processes P using space O dn O(ε -2 ) such that any ε-approximate hyperplane range query is answered correctly with high probability. The query time is O d/ε 2 . The data structure of Chazelle et al. ( 2008) is randomized and in particular employs randomized dimensionality reduction. Thus, it is feasible that queries might fail for multiple adaptive interactions with the data structure. By utilizing our framework of Section C and Theorem 1.2, we can obtain the following robust guarantee. Theorem D.6. Given a set of points P that lie in the unit ℓ 2 ball, there exists a data structure which pre-processes P using space O √ Qdn O(ε -2 ) such that Q adaptive ε-hyperplane range queries are answered correctly with high probability. The query time is O d/ε 2 .

D.4 APPLICATION: POINT QUERIES ON TURNSTILE STREAMS

In the problem of point queries on turnstile streams, there exists a stream of m updates. Each update specifies a coordinate i ∈ [n] of an underlying frequency vector f ∈ R n and changes f i by some amount between ∆ i ∈ [-∆, ∆], where ∆ = poly(n). Given any constant accuracy parameter ε > 0 any time t ∈ [m], we define f (t) to be the frequency vector implicitly defined after the first t updates. Then the point query problem is to output f (t) i for various choices of t ∈ [m] and i ∈ [n] within an additive error of ε∥f (t) ∥ 1 . Theorem D.7 (Alman & Yu (2020)). There exists an algorithm that uses space O log 2 n bits, worst-case update time O log 0.582 n , and query time O log 1.582 n , that supports point queries with ε = 0.1 with high probability. An important quality of Theorem D.7 is that it significantly improves the update time over previous data structures, e.g., Charikar et al. (2004) , at a cost in query time. By applying Theorem 1.2, we can avoid a blow-up in query time while still enjoying the update time improvements: Theorem D.8. There exists an algorithm that uses space O √ Q log 3 (nQ) bits, has worst-case update time O √ Q log 1.582 (nQ) and query time O log 3 (nQ) , and supports Q adaptive point queries with ε = 0.1 and with high probability.

E ADAPTIVE DISTANCE ESTIMATION

In the adaptive distance estimation problem, there exists a set X = {x (1) , . . . , x (n) } of n points in R d . Given an accuracy parameter ε > 0, the goal is to output a (1 + ε)-approximation to ∥x (i) -q∥ p for each query q across all points x (i) ∈ X, while minimizing the space, query time, or pre-processing time for the corresponding data structures. The trivial solution stores all n points and computes all n distances to each query point and thus can handle an unlimited number of queries. Since each point has dimension d, the trivial solution uses space and query time O (nd). In this section, we first apply our framework to show a data structure that can handle Q queries of approximate distances from a specified point in X, using query time O n+d ε 2 , pre-processing time O nd √ Q ε 2 , and space O (n+d) √ Q ε 2 . Hence for d ≫ n √ Q, our data structure already improves on the work of Cherapanamjeri & Nelson (2020). However in this setting, each of the Q queries returns only the approximate distance between a query point and a single point in X. By comparison, Cherapanamjeri & Nelson (2020) outputs approximate distances to all points in X and moreover, follow-up work by Cherapanamjeri & Nelson (2022) improved the pre-processing time to O nd ε 2 . Therefore, we address these two shortcomings of our framework by giving a data structure that (1) handles the case where we return the approximate distances of all points in X from Q adaptive query points and (2) achieves pre-processing time O nd ε 2 . For completeness, we now show correctness of our algorithm across all Q adaptive queries, though we remark that the proof can simply be black-boxed into Theorem 1.2. Theorem E.1. With high probability, we have (1 -ε)∥x iq -y q ∥ 2 ≤ d i ≤ (1 + ε)∥x iq -y q ∥ 2 , Algorithm 7 Adaptive Distance Estimation  d i ← PrivMed({d i,j } j∈[k] ), where PrivMed is (1, 0)-DP. 10: return d i for all q ∈ [Q]. Proof. Fix query (y q , i q ) with q ∈ [Q] and i q ∈ [n]. Let S be a set of k indices sampled (with replacement) from [r]. By Theorem A.6 or Theorem A.7, then we have for each j ∈ [k], Pr (1 -ε)∥x iq -y q ∥ 2 ≤ ∥Π Sj (x iq -y q )∥ 2 ≤ (1 + ε)∥x iq -y q ∥ 2 ≥ 3 4 . Let I j be an indicator variable so that Q log(nQ) , 0 -differentially private. Thus, by the advanced composition of differential privacy, i.e., Theorem A.4, the mechanism permits Q adaptive queries and is O (1) , 1 poly(nQ)differentially private. By the generalization properties of differential privacy, i.e., Theorem A.5, we have I j = 1 if (1 -ε)∥x iq -y q ∥ 2 ≤ ∥Π Sj (x iq -y q )∥ 2 ≤ (1 + ε)∥x iq - Pr   1 k j∈[k] I j -E [I] ≥ 1 10   < 1 poly(Q, n) , for sufficiently small O (1). Thus we have Pr   1 k i∈[k] I i > 0.6   > 1 - 1 poly(Q, n) , which implies that (1-ε)∥x iq -y q ∥ 2 ≤ d i ≤ (1+ε)∥x iq -y q ∥ 2 . Therefore, by a union bound across Q adaptive queries (y q , x iq ) with q ∈ [Q], we have that (1-ε)∥x iq -y q ∥ 2 ≤ d i ≤ (1+ε)∥x iq -y q ∥ 2 for all q ∈ [Q] with high probability. Theorem E.2. There exists an algorithm that answers Q adaptive distance estimation queries within a factor of (1 + ε). For O 

E.1 FASTER PRE-PROCESSING TIME FOR ADAPTIVE DISTANCE ESTIMATION

In this section, we present an improved algorithm for Adaptive Distance Estimation, which allows the release of distances to all n points in the dataset for a single query, matching the query time of Cherapanamjeri & Nelson (2020) with an improved space complexity of O(ε -2 √ Qn). Our results utilize a class of structured randomized linear transformations based on Hadamard matrices recursively defined below: H 1 = [1] H d = H d/2 H d/2 H d/2 -H d/2 . The associated class of randomized linear transformations are now defined below: {D j } j∈[m] ⊂ R d×d s.t D j k,l iid ∼ N (0, I) if k = l 0 otherwise ∀z ∈ R d : h(z) =     H d D 1 H d D 2 . . . H d D m     • z. (SRHT) Note that for any vector z, h(z) may be computed in time O(md log d) due to the recursive definition of the Hadamard transform. We now let ϕ and Φ denote the pdf and cdf of a standard normal random variable, Quant α ({a i } i∈[l] ) the α th quantile of a multi-set of real numbers {a i } i∈[l] for any l ∈ N and define ψ r as follows: ∀r > 0, a ∈ R : ψ r (a) := min(|a|, r). Through the remainder of the section, we condition on the event defined in the following lemma: Lemma E.3 (Claims 5.1 and 5.2 Cherapanamjeri & Nelson (2022)). For any δ ∈ 0, 1 2 , with probability at least 1 -δ: ∀z s.t ∥z∥ = 1 : 2 ≤ Quant α-β/4 {h(z) i } i∈[md] ≤ Quant α+β/4 {h(z) i } i∈[md] ≤ 4 ∀z s.t ∥z∥ = 1, r ≥ 4 log(1/ε) : 1 - ε 2 ≤ 1 md • π 2 • i∈[md] ψ r (h i (z)) ≤ 1 + ε 2 as long as m ≥ Cε -2 log(2/δ) log 5 (d/ε) for some absolute constant C > 0. We satisfies the conclusion of Lemma E.3. Then, there is an algorithm, RetNorm, which satisfies for all x ∈ R d : P S {(1 -ε) • ∥x∥ ≤ RetNorm(h(x) S ) ≤ (1 + ε) • ∥x∥} ≥ 1-δ for S = {i j } j∈[k] with i j iid ∼ Unif([md]) when k ≥ Cε -2 log(2/ε) log(2/δ) for some C > 0. Furthermore, RetNorm runs in time O(k). With these primitives, we will construct our data structure for adaptive distance estimation. Our constructions is formally described in Algorithm 8.

Algorithm 8 Adaptive Distance Estimation with SRHTs

1: m ← Cε -2 log 6 (2dn/ε) 2: Let h be an SRHT as defined in SRHT ▷Revealed to analyst 3: r ← C √ Q log 3 (nd), k ← Cε -2 log(2/ε) log(2nd) 4: for i ∈ [n] do 5: Compute y i = h(x i ) 6: for j ∈ [r] do 7: Let S i,j be a set of k indices sampled with replacement from [md] 8: l ← C log(nd) 9: for j ∈ 1 : Q do ▷Adaptive queries 10: Receive query q j 11: v j ← h(q j ) 12: for i ∈ [n] do 13: Let {t i,j,p } p∈[l] be a set of l indices sampled (with replacement) from [r]  d i,j ← PrivMed({d i,j,p } p∈[l] ), where PrivMed is (O (1) , 0)-DP. 17: return {d i,j } i∈[n] The proof of correctness of Algorithm 8 will follow along similar lines to that of Algorithm 6 with a more refined analysis of the privacy loss incurred due to the adaptivity of the data analyst. In particular, each input query results in n different queries made to a differentially private mechanism PrivMed leading to a total of nQ queries. A naïve application of Theorem 1.2 would thus result in a data structure with space complexity scaling as O(n 3/2 √ Q) as opposed to the desired O(n √ Q) and query complexity O(ε -2 nd). The key insight yielding the improved result is the privacy loss incurred by a single query is effectively amortized across n independent differentially private algorithms each capable of answering Q adaptively chosen queries correctly with high probability. To start, we first condition on the event in Lemma E.3 and assume public access to the correspondingly defined SRHT h. We now use R to denote the randomness used to instantiate the multisets, S i,j , in Algorithm 8 and decompose it as follows R = {R i } i∈[n] with R i = {R i,j } j∈[r] where R i,j corresponds to the randomness used to generate the set S i,j and the random elements t i,p . As in the proof of Theorem 1.2, we define a transcript T = {T j } j∈[Q] with T j = (q j , {d i,j } i∈[n] ) denoting the j th query and the responses returned by Algorithm 8 as a single transaction. Lemma E.5. For all i ∈ [n], j ∈ [Q], T j is o 1 √ Q log(nQ) , 0 -differentially private with R i . Proof. The proof is identical to that of Lemma C.1 with the observation that each transaction T j only results in a single query to a differentially private mechanism operating on R i . Lemma E.6. For all i ∈ [n], T is o(1), 1 poly(nQ) -differentially private with respect to R i . Proof. The proof is identical to Lemma C.2 and follows from Theorem A.4 and Lemma E.5. We now prove the correctness of our improved procedure for adaptive distance estimation. Proof of Theorem 1.3: We condition on the event in the conclusion of Lemma E.3 start by bounding the failure probability of a single query. The bound for the whole sequence of adaptively chosen queries follows by a union bound. Now, fixing i ∈ [n] and j ∈ [Q], note that the sub-transcript  T (j) = {T p } p∈[j-1] is o(1), p := 1 (1 -ε) • ∥q j -x i ∥ ≤ RetNorm (v j -y i ) Si,t i,j,p ≤ (1 + ε) • ∥q j -x i ∥ Additionally, defining W := l p=1 W p , we get by the differential privacy of the sub-transcript, T (j) , Lemma E.4 and Theorem A.5: P W ≤ 3 4 • l ≤ 1 400 • (nQ) 2 . Consequently, we get from Theorem A.3 and another union bound: P {(1 -ε) • ∥q j -x i ∥ ≤ d i,j ≤ (1 + ε) • ∥q j -x i ∥} ≥ 1 - 1 200 • (nQ) 2 . A subsequent union bound over all i ∈ [n], j ∈ [Q] yields: P {∀i ∈ [n], j ∈ [Q] : (1 -ε) • ∥q j -x i ∥ ≤ d i,j ≤ (1 + ε) • ∥q j -x i ∥} ≥ 1 - 1 200 • (nQ) . A final union bound over the conclusion of Lemma E. ). In the adaptive kernel density estimation problem, the input is a set X = {x (1) , . . . , x (n) } of n points in R d . Given an accuracy parameter ε > 0 and a threshold parameter τ > 0, the goal is to output a (1 + ε)-approximation to the quantity 1 n i∈[n] k(x (i) , q), for a kernel function k under the promise that the output is at least τ . A standard approach is to sample O However, the analysis for both these algorithms fails for the adaptive setting, where there can be dependencies between the query and the data structure. By using the data structure of Backurs et al. (2019) as a subroutine, our framework immediately implies an algorithm for adaptive kernel density estimation that uses O √ Q τ ε 2 space and O d log Q √ τ ε 2 query time to answer each of Q adaptive queries.

Algorithm 9 Adaptive Kernel Density Estimation

Input: Number Q of queries, accuracy ε, threshold τ 1: r ← O √ Q log 2 Q 2: for i ∈ [r] do ▷Pre-processing 3: Let T i be a KDE data structure 4: for each query y q ∈ R d with q ∈ [Q] do ▷Adaptive queries 5: Let S be a set of k indices sampled (with replacement) from [r] 6: for i ∈ [k] do 7: Let D i be the output of T Si on query y q 8: return d q = PrivMed({D i } i∈[k] ), where PrivMed is (1, 0)-DP. For completeness, we now show adversarial robustness of our algorithm across Q adaptive queries. Again we remark that the proof can simply be black-boxed into Theorem 1.2, though we include the specific kernel density details in the following proof as a warm-up for the following section. Lemma F.2. Algorithm 9 answers Q adaptive kernel density estimation queries within a factor of (1 + ε), provided each query has value at least τ . Proof. Fix query y q ∈ R d with q ∈ [Q]. Let S be a set of k indices sampled (with replacement) from [r]. Then by Theorem F.1, we have that for each j ∈ [k], Pr D Sj (y) -KDE(X, y) ≤ ε • KDE(X, y) ≥ 3 4 . Let I j be an indicator variable so that I j = 1 if D Sj (y) -KDE(X, y) ≤ ε • KDE(X, y) and I j = 0 otherwise, so that we have Pr I i > 0.6   > 0.99, so that |d q -KDE(X, y q )| ≤ ε • KDE(X, y) across Q queries y q with q ∈ [Q]. In this section, we go beyond the limits of our framework and analyze the case where there may be an unbounded number of adversarial queries. Theorem 1.4. Suppose the kernel function k is L-Lipschitz in the second variable for some L > 0, i.e., |k(x, y) -k(x, z)| ≤ L∥y -z∥ 2 for all x, y, z ∈ R d . Moreover, suppose that for all ∥x -y∥ 2 ≤ ρ, we have k(x, y) ≤ τ 3 . Then an algorithm that produces a kernel density estimation data structure D that is L-Lipschitz over a set X of points with diameter at most ∆ and outputs a (1+ε)-approximation to KDE queries with value at least τ with probability at least 1 -δ using space S(n, ε, τ, log δ) and query time T (n, ε, τ, log δ), then there exists a KDE data structure that with probability at least 0.99, outputs a (1 + ε)-approximation to any number of KDE queries with value at least τ using space Proof. Given a set X ⊆ R d of n points with diameter ∆, let N be an ετ L -net over a ball of radius ∆ + ρ that contains X. More formally, let B be a ball of radius (∆ + ρ) that contains X and for every y ∈ B, there exists a point z ∈ N such that ∥y -z∥ 2 ≤ ετ L . We can construct the net greedily so that |N | ≤ 2(∆+ρ)L ετ d . We implement a data structure D that answers each (non-adaptive) kernel density estimation query with multiplicative approximation 1 + ε 3 for any kernel density estimation query with value at least τ 2 , with probability at least 1 -δ, where δ ≤ 1 100|N | . Then by a union bound, D correctly answers each kernel density estimation query in N with probability at least 0.99. [I j = 1] ≥ 3 4 or equivalently, E [I j ] ≥ 3 4 . Let I = 1 k j∈[k] I j so that E [I] = 1 k j∈[k] E [I j ] ≥ Let q ∈ R d be an arbitrary query such that KDE(X, q) ≥ τ . By assumption, we have that ∥q -x∥ 2 ≤ ρ for some x ∈ X and thus q ∈ B. By the definition of N , there exists some y ∈ N such that ∥q -y∥ 2 ≤ ετ 3L . Then since k is L-Lipschitz in the second variable, we have | KDE(X, q) -KDE(X, y)| = 1 n x∈X k(x, q) - 1 n x∈X k(x, y) ≤ L n ∥q -y∥ 2 ≤ ετ 3n . Hence, KDE(X, q) ≥ τ implies that KDE(X, y) ≥ τ 2 . Let K y be the output of the data structure D on query y. Then by correctness of D on N for any query with threshold at least τ 2 , we have |K y -KDE(X, y)| ≤ ε 3 KDE(X, y). Let K q be the output of the data structure D on query y. Since the algorithm itself is L-Lipschitz, then |K q -K y | ≤ L∥q -y∥ 2 ≤ ετ 3 . Therefore by the triangle inequality, we have that |K q -KDE(X, q)| ≤ |K q -K y | -|K y -KDE(X, y)| -| KDE(X, y) -KDE(X, q)| ≤ ετ 3 + ε 3 KDE(X, y) + ετ 3n . Since KDE(X, y) ≤ KDE(X, q) + ετ 3n , then it follows that |K q -KDE(X, q)| ≤ ετ 3 + ε 3 KDE(X, q) + ε 2 τ n + ετ 3n ≤ ε KDE(X, q), for n ≥ 6. In particular, sampling-based algorithms for kernels that are Lipschitz are also Lipschitz. Thus to apply Theorem 1.4, it suffices to identify kernels that are L-Lipschitz and use the data structure of Theorem F.1. To that end, we note that the kernels k(x, y) = with the map Π. The performance of all other algorithms are indistinguishable in Figure 1a . Thus, we only zoom into the performances of our algorithm and Baseline 1 and Baseline 2, shown in Figure 1b . For these three algorithms, we plot a histogram of answers outputted by the respective algorithms across all iterations. We see that the algorithm of Cherapanamjeri & Nelson (2020), shown in the blue shaded histogram, is the most accurate as it has the smallest deviations from the true answer of 1.0. Our algorithm, shown in green, is noisier than Baseline 1 since it has a wider range of variability. This may be due to the fact that we use a differentially private median algorithm, which naturally incurs additional noise. Lastly, Baseline 2 is also noisier than Baseline 1 and comparable to our algorithm. This may be due to the fact that the algorithm of Cherapanamjeri & Nelson (2022) requires very fine-tuned constants in their theoretical bounds, which naturally deviate in practice. Lastly, Figure 1c shows the cumulative runtime of all three algorithms across all iterations. Our algorithm, shown in green, is the fastest while Baseline 2 is the slowest. This is explained by the fact that Baseline 2 calculates many more Hadamard transforms than our algorithm does.



available in https://github.com/FALCONN-LIB/FFHT available in https://github.com/google/differential-privacy



al., 2012; Bogunovic et al., 2017; Naor & Yogev, 2019; Avdiukhin et al., 2019). Motivated by both practical needs and a lack of theoretical understanding, there has been a recent flurry of theoretical studies of adversarial robustness. The streaming model of computation has especially received significant attention Ben-Eliezer et al. (2021); Hassidim et al. (2020); Woodruff & Zhou (2021); Kaplan et al. (2021); Braverman et al. (2021); Chakrabarti et al. (2022); Ajtai et al. (2022); Chakrabarti et al. (2022); Ben-Eliezer et al. (2022); Assadi et al. (2022); Attias et al. (2023); Dinur et al. (2023); Woodruff et al.

al. (2020); Beimel et al. (2022); Attias et al. (2023) suffices to achieve robust algorithms across these different settings.

(n,Q) , using space O S √ Q log(nQ) and query time O T log(nQ) + log 3 (nQ) . Theorem 1.2 invokes the framework of Hassidim et al. (2020); Beimel et al. (2022); Attias et al. (2023) to the centralized setting, where a number of queries are made only after the data structure is created.

S n, O (ε) , O (τ ) , O d log (∆+ρ)L ετ and query time T n, O (ε) , O (τ ) , O d log (∆+ρ)L ετ .

Hassidim et al. (2020); Beimel et al. (2022); Attias et al. (2023). Our full algorithm maintains Γ = O √ T log(nT ) independent copies of the key subroutine for T = O nnz(A) ε 2 K

Compute the approximate row leverage scores of A. ▷Lemma A.11 2: Build and output the data structure D LS Algorithm 4 Dynamic algorithm for maintaining regression cost under adaptive updates Input: A sequence of target labels b (i) m i=1 and a fixed design matrix A ∈ R n×d , an approximation factor ε ∈ (0, 1), the leverage score sampling data structure D LS for A. Output: Estimates of the regression cost F (A, b (i) ) under adaptively chosen updates to b. 1: for every epoch of T = O nnz(A) nT ) independent instances of the data structure in Section 3.2 via Algorithm 1. Run PrivMed on the Γ instances with privacy parameter ε ′ = O 1 √ T log(nT ) with failure probability δ = 1 poly(m,T ) . 4:

Private median, e.g.,Hassidim et al. (2020)). Given a database D ∈ X * , there exists an (ε, 0)-differentially private algorithm PrivMed that outputs an element x ∈ X such that with probability at least 1 -δ, there are at least |S| 2 -k elements in S that are at least x, and at least |S| 2 -k elements in S in S that are at most x, for k = O 1 ε log |X| δ . Theorem A.4 (Advanced composition, e.g., Dwork et al. (2010)). Let ε, δ ′ ∈ (0, 1] and let δ ∈ [0, 1]. Any mechanism that permits k adaptive interactions with mechanisms that preserve (ε, δ)-differential privacy guarantees (ε ′ , kδ + δ ′ )-differential privacy, where ε ′ = 2k ln 1 δ ′ • ε + 2kε 2 . Theorem A.5 (Generalization of DP, e.g., Dwork et al. (2015); Bassily et al. (

Moreover, Πx takes O d ε 2 time to compute. Theorem A.7 (Fast JL). Given ε > 0, there exists a family of random maps Π m,d ∈ R m×d with m = O log d ε 2

Lemma A.11 (Leverage score computation and sampling data structure; see Woodruff (2014); Chepurko et al. (2022)). Let A ∈ R n×d . There exists an algorithm that given A outputs a vector of row leverage score overestimates with high probability and in run-time O (nnz(A) + poly(d)).

log(nT ) , 0 -differentially private with respect to the algorithms A 1 , . . . , A Γ , since the private median algorithm PrivMed is O 1 √

et al. (2020); Beimel et al. (2022); Attias et al. (

a set of k indices sampled (with replacement) from [r] 6:

et al. (2020) and in the dynamic model Beimel et al. (2022); Attias et al. (2023). Lemma C.1. For a fixed iteration, T i is O 1 √ Q log(nQ) , 0 -differentially private with respect to R.

Cherapanamjeri & Nelson (2020) first improved the query time to O n+d ε 2 at the cost of using O (n+d)d ε 2 space and O nd 2 ε 2 pre-processing time. Like the trivial solution, the algorithm of Cherapanamjeri & Nelson (2020) also permits an arbitrary number of queries.

y q ∥ 2 and I j = 0 otherwise, so that we havePr [I j = 1] ≥ 3 4 , or equivalently, E [I j ] ≥ 3 4 . Let I = 1 k j∈[k] I j so that by linearity of expectation, E [I] = 1 k j∈[k] E [I j ] ≥ 3 4 .To address adaptive queries, we first note that PrivMed is (1, 0)-differentially private on the outputs of the r Fast JL transforms. Since we sample k = O (log(nQ)) groups from the r = O √ Q log 2 (nQ) groups with replacement, then by amplification via sampling, i.e., Theorem A.2, PrivMed is O 1 √

+ d log d log(nQ) query time, it stores O n √ Q log 3 (nQ) ε 2 words of space. For O d ε 2 log(nQ) query time, it stores O n √ Q log 2 (nQ) ε 2 words of space. Proof. By Theorem A.7, each fast JL transform uses O log d ε 2 + d log d runtime and stores m = O log d ε 2 rows. On the other hand, by Theorem A.6, each JL transform uses O d ε 2 + d log d runtime and stores m = O log d ε 2 rows. By comparison, Cherapanamjeri & Nelson (2020) uses O nd log n ε 2 words of space and O d ε 2 query time.

will additionally require the following technical result from Cherapanamjeri & Nelson (2022), where for any vector v ∈ R d and multiset S = {i j } j∈[k] with i j ∈ [d], v S denotes the vector [v i1 , . . . , v i k ]: Lemma E.4 (Theorem 1.4 Cherapanamjeri & Nelson (2022)). Assume h : R d → R md (SRHT)

points and then use O d τ ε 2 query time to output the empirical kernel density for a specific query. Backurs et al. (2019) give an algorithm for kernel density estimation that uses O 1 τ ε 2 space and O d √ τ ε 2 query time, improving over the standard sampling approach. Theorem F.1. Backurs et al. (2019) Given ε, τ > 0, there exists a data structure D that uses O 1 τ ε 2 space and O d ε 2 √ τ query time that outputs a (1 + ε)-approximation D(y) to a kernel density estimation query y that has value at least τ , i.e., Pr [|D(y) -KDE(X, y)| ≤ ε • KDE(X, y)] ≥ 3 4 .

To handle adaptive queries, we first note that PrivMed is (1, 0)-differentially private on the outputs of the r kernel density estimation data structures. We samplek = O (log Q) indices from the r = O √ Q log 2 Q data structureswith replacement. Thus by amplification via sampling, i.e., Theorem A.2, PrivMed is O 1 √ Q log Q , 0 -differentially private. By the advanced composition of differential privacy, i.e., Theorem A.4, our algorithm can answer Q adaptive queries with O (1) , 1 poly(Q) -differentially privacy. By the generalization properties of differential privacy, i.e., Theorem A.5, we have Pr for sufficiently small constant O (1) in the private median algorithm PrivMed. Therefore,

There exists an algorithm that usesO √ Q log 2 Q τ ε 2space and answers Q adaptive kernel density estimation queries within a factor of (1 + ε), provided each query has value at least τ .Each query uses O d log(nQ)ε 2 √ τ runtime.By comparison, random sampling, e.g.,Charikar  & Siminelakis (2017), uses Q τ ε 2 samples to answer Q queries and each query uses O d τ ε 2 runtime and using Q copies of the data structure by Backurs et al. (2019) uses O Q τ ε 2 space and O d ε 2 √ τ runtime. F.1 UNLIMITED ADAPTIVE QUERIES FOR KERNEL DENSITY ESTIMATION

S n, O (ε) , O (τ ) , O d log (∆+ρ)L ετ and query time T n, O (ε) , O (τ ) , O d log (∆+ρ)L ετ .

C+∥x-y∥2 for C > 0 and k(x, y) = Ce -∥x-y∥2 are both Lipschitz for some function of C. In particular, we have|k(x, y) -k(x, z)| = C C + ∥x -y∥ 2 -C C + ∥x -z∥ 2 = C|∥x -z∥ 2 -∥x -y∥ 2 | (C + ∥x -y∥ 2 )(C + ∥x -z∥ 2 ) ≤ ∥y -z∥ 2 C , so k(x, y) = C C+∥x-y∥2 is 1 C -Lipschitz. Similarly, since e -x is 1-Lipschitz, then |k(x, y) -k(x, z)| = Ce -∥x-y∥2 -Ce -∥x-z∥2 ≤ C|∥x -z∥ 2 -∥x -y∥ 2 | ≤ C∥y -z∥ 2 ,so k(x, y) = Ce -∥x-y∥2 is C-Lipschitz.

Fig. 1: Figures for our experiments.

al., 2013; Szegedy et al., 2014; Goodfellow et al., 2015; Carlini & Wagner, 2017;

Cherapanamjeri & Nelson (2020) improved the query time to O n+d

Backurs et al. (2019) give an algorithm for kernel density estimation that uses O 1 τ ε 2 space and O

For linear regression, we can again compute a subspace embedding M = SA ∈ R m×n and answer a query b (i) by approximately solving min x∈R d ∥SAx -Sb (i) ∥ 2 , where S is a sketching matrix (Clarkson & Woodruff, 2013).

(nQ)  , k ← O (log(nQ)) 2: Let Π 1 , . . . , Π r ∈ R m×d be a JL transformation matrix (see Theorem A.6 or Theorem A.7) 3: for j ∈ [r] do

RetNorm((v j -y i ) Si,t i,j,p )

3 concludes the proof. The runtime guarantees follow from the fact that for all z ∈ R d , h(z) is computable in time O(md log d) and the runtime guarantees of RetNorm. □ F ADAPTIVE KERNEL DENSITY ESTIMATION Kernel density estimation is an important problem in learning theory and statistics that has recently attracted significant interest, e.g., (Charikar & Siminelakis, 2017; Backurs et al., 2018; Charikar et al., 2020; Bakshi et al., 2022

ACKNOWLEDGEMENTS

Sandeep Silwal is supported by an NSF Graduate Research Fellowship under Grant No. 1745302, and NSF TRIPODS program (award DMS-2022448), NSF award CCF-2006798, and Simons Investigator Award (via Piotr Indyk). This work was done in part while David P. Woodruff was at Google Research and supported in part by a Simons Investigator Award and by the National Science Foundation under Grant No. CCF-1815840. Fred Zhang is supported by ONR grant N00014-18-1-2562. This work was done in part while Samson Zhou was at Carnegie Mellon University and supported in part by a Simons Investigator Award and by the National Science Foundation under Grant No. CCF-1815840.

A PRELIMINARIES

Notations In this paper, we use [n] for a positive integer n > 0 to denote the set {1, . . . , n}. We use poly(n) to denote a fixed polynomial in n. We say an event occurs with high probability if it occurs with probability 1 -1 poly(n) . For real numbers a, b and positive ε, we say a = (1 ± ε)b if (1 -ε)b ≤ a ≤ (1 + ε)b. Let e i ∈ R n be the i'th standard basis vector. Let X + denote the Moore-Penrose pseudo-inverse of matrix X. Let ∥X∥ denote the operator norm of X. Let κ(X) = ∥X + ∥ ∥X∥ denote the condition number of X.

A.1 DIFFERENTIAL PRIVACY

Much of our technical results leverage tools from DP. We recall its definition and several key statements. Definition A.1 (Differential privacy, Dwork et al. (2006) ). Given ε > 0 and δ ∈ (0, 1), a randomized algorithm A : X * → Y is (ε, δ)-differentially private if, for every neighboring datasets S and S ′ and for all E ⊆ Y,Theorem A.2 (Amplification via sampling, e.g., Bun et al. (2015) ). Let A be an (ε, δ)-differentially private algorithm for ε ≤ 1, δ ∈ (0, 1). Given a database S of size n, let A ′ be the algorithm that constructs a database T ⊂ S by subsampling (with replacement) s ≤ n 2 rows of S and outputs A(T ).

G EMPIRICAL EVALUATION

We empirically demonstrate the space and query time efficiency of our framework of Section C. We consider the problem of ℓ 2 norm estimation where queries q 1 , q 2 , . . . are generated in an adaptive fashion and our goal is to output an estimate of ∥q i ∥ 2 for all i. This setting is a special case of adaptive distance estimation and captures the essence of our adversarial robustness framework. In addition, this same setting was investigated empirically in prior works Cherapanamjeri & Nelson (2020).Experimental Setup. Consider the setting of Algorithm 6: it creates r copies of an underlying randomized data structure and upon a query, it subsamples k of them and outputs an answer aggregated via the private median. In our setting, the underlying algorithm will be the fast Johnson-Lindenstrauss (JL) transform which is defined as follows: it is the matrix P HD : R d → R m where D is a diagonal matrix with uniformly random ±1 entries, H is the Hadamard transform, and P is a sampling matrix uniformly samples m rows of HD. Our algorithm will initialize r copies of this matrix where the sampling matrix P and diagonal D will be the randomness which is "hidden" from the adversary. Upon query q, we sample k different Fast JL data structures, input q to all of them, and proceed as in Algorithm 6. Note that this setting exactly mimics the theoretical guarantees of Section C and is exactly Algorithm 7 of Section E. In our experiments, d = 4096, m = 250, r = 200, and k = 5. These are exactly the parameters chosen in prior works Cherapanamjeri & Nelson (2020). We will have 5000 adaptive queries q i which are described shortly. Our experiments are done on a 2021 M1 Macbook Pro with 32 gigabytes of RAM. We implemented all algorithms in Python 3.5 using Numpy.The Hadamard transform code is from Andoni et al. ( 2015) 1 and we use Google's differential privacy library 2 for the private median implementation.Baselines. We will consider three baselines. JL will denote a standard (Gaussian) JL map from dimension 4096 to 250. Baseline 1 will denote the algorithm of Cherapanamjeri & Nelson (2020) . At a high level, it instantiates many independent copies of the standard Gaussian JL map and only feeds an incoming query into a select number of subsampled data structures. Note that our experimental setting is mimicking exactly that of Cherapanamjeri & Nelson (2020) where the same parameters r (number of different underlying data structures) and k (number of subsampled data structures to use for a query) were used. This ensures that both our algorithm and theirs have access to the same number of different JL maps and thus allows us to compare the two approaches on an equal footing. The last baseline, denoted as Baseline 2, is the main algorithm of Cherapanamjeri & Nelson (2022) which is the optimized version of Cherapanamjeri & Nelson (2020) . At a high level, their algorithm proceeds similarly to that of Cherapanamjeri & Nelson (2020), except they employ Hadamard transforms (after multiplying the query entry-wise by random Gaussians), rather than using Gaussian JL maps. Furthermore, instead of subsampling, their algorithm feeds an incoming query into all the different copies of the Hadamard transform, and subsamples the coordinates of the concatenated output for norm estimation. We again set the parameters of their algorithm to match that of our algorithm and Baseline 1 by using r copies of their Hadamard transform and subsampling mk total coordinates. We refer to the respective papers for full details of their algorithms.Summary of adaptive queries. Our input queries are the same adaptive queries used in Cherapanamjeri & Nelson (2020). To summarize, let Π denote the map used in the JL benchmark stated above. The i-th query for 1 ≤ i ≤ 5000 will be of the form q i = i j=1 (-1) Wi z i , which we then normalize to have unit norm. The z i are standard Gaussian vectors. W i is the indicator variable for the event ∥Π(z i -e 1 )∥ 2 ≤ ∥Π(z i + e 1 )∥ 2 where e 1 is the first standard basis vector. Intuitively, the queries become increasingly correlated with the matrix Π since we successively "augment" the queries in a biased fashion. See Section 5 of Cherapanamjeri & Nelson (2020) for a more detailed discussion of the adaptive inputs.Results. Our results are shown in Figure 1 . In Figure 1a , we plot the norm estimated by each of the algorithms in each of the queries across iterations. We see that the naïve JL map increasingly deviates from the true value of 1.0. This is intuitive as the adaptive queries are increasingly correlated

