PRIVATE DATA STREAM ANALYSIS FOR UNIVERSAL SYMMETRIC NORM ESTIMATION

Abstract

We study how to release summary statistics on a data stream subject to the constraint of differential privacy. In particular, we focus on releasing the family of symmetric norms, which are invariant under sign-flips and coordinate-wise permutations on an input data stream and include L p norms, k-support norms, top-k norms, and the box norm as special cases. Although it may be possible to design and analyze a separate mechanism for each symmetric norm, we propose a general parametrizable framework that differentially privately releases a number of sufficient statistics from which the approximation of all symmetric norms can be simultaneously computed. Our framework partitions the coordinates of the underlying frequency vector into different levels based on their magnitude and releases approximate frequencies for the "heavy" coordinates in important levels and releases approximate level sizes for the "light" coordinates in important levels. Surprisingly, our mechanism allows for the release of an arbitrary number of symmetric norm approximations without any overhead or additional loss in privacy. Moreover, our mechanism permits (1 + α)-approximation to each of the symmetric norms and can be implemented using sublinear space in the streaming model for many regimes of the accuracy and privacy parameters.

1. INTRODUCTION

The family of L p norms represent important statistics on an underlying dataset, where the L p norm of an n-dimensional vector freqeuncy x is defined as the number of nonzero coordinates of x for p = 0 and L p (x) = (x p 1 + . . . + x p n ) 1/p for p > 0. Thus, L 0 norm counts the number of distinct elements in the dataset and, e.g., is used to detect denial of service or port scan attacks in network monitoring (Akella et al., 2003; Estan et al., 2003) , to understand the magnitude of quantities such as search engine queries or internet graph connectivity in data mining (Palmer et al., 2001) , to manage workload in database design (Finkelstein et al., 1988) , and to select a minimum-cost query plan in query optimization (Selinger et al., 1979) . The L 1 norm computes the total number of elements in the dataset and, e.g., is used for data mining (Cormode et al., 2005) and hypothesis testing (Indyk & McGregor, 2008) , while the L 2 norm, e.g., is used for training random forests in machine learning (Breiman, 2001) , computing the Gini index in statistics (Lorenz, 1905; Gini, 1912) , and network anomaly detection in traffic monitoring (Krishnamurthy et al., 2003; Thorup & Zhang, 2004) . Consequently, L p estimation has been extensively studied in the data stream model (Alon et al., 1999; Indyk & Woodruff, 2005; Indyk, 2006; Li, 2008; Kane et al., 2011; Andoni, 2017; Braverman et al., 2018b; Ganguly & Woodruff, 2018; Woodruff & Zhou, 2020; 2021) . The simplest streaming model is perhaps the insertion-only model, in which a sequence of m updates increments coordinates of an n-dimensional frequency vector x and the goal is to compute or approximate some statistic of x in space that is sublinear in both m and n. In many cases, the underlying dataset contains sensitive information that should not be leaked. Hence, an active line of work has focused on estimating L p norms for various values of p, while preserving differential privacy (Mir et al., 2011; Blocki et al., 2012; Smith et al., 2020; Bu et al., 2021; Wang et al., 2021) . Definition 1.1 (Differential privacy). (Dwork et al., 2006) Given ε > 0 and δ ∈ (0, 1), a randomized algorithm A : U * → Y is (ε, δ)-differentially private if, for every neighboring streams S and S ′ and for all E ⊆ Y, Pr [A(S) ∈ E] ≤ e ε • Pr [A(S ′ ) ∈ E] + δ, where streams S and S ′ are neighboring if there exists a single update i ∈ [m] such that u i ̸ = u ′ i , where u 1 , . . . , u m are the updates of S and u ′ 1 , . . . , u ′ m are the updates of S ′ . For example, (Blocki et al., 2012) showed that the Johnson-Lindenstrauss transformation preserves differential privacy (DP), thereby showing one of the main techniques in the streaming model for L 2 estimation already guarantees DP. Similarly, (Smith et al., 2020) showed that the Flajolet-Martin sketch, which is one of the main approaches for L 0 estimation in the streaming model, also preserves DP. Notably, algorithmic designs for L p estimation in the streaming model differ greatly and require individual analysis to ensure DP, which can be quite difficult due to the complexity of the various techniques. This is especially pronounced in the work of (Wang et al., 2021) , who studied the p-stable sketch that estimates the L p norm for p ∈ (0, 2] (Indyk, 2006)foot_0 . (Wang et al., 2021) showed that for p ∈ (0, 1], the p-stable sketch preserves DP, but was unable to show DP for p ∈ (1, 2], even though the general algorithmic approach remains the same. Thus the natural question is whether differential privacy can be guaranteed for an approach that simultaneously estimates the L p norm in the streaming model, for all p. More generally, the family of L p norms are all symmetric norms, which are invariant under sign-flips and coordinate-wise permutations on an input data stream. Symmetric norms thus also include other important families of norms such as the ksupport norms and the top-k norms. In this paper, we show that not only does there exist a differentially private algorithm for the estimation of symmetric norms in the streaming model, but also that there exists an algorithm that privately releases a set of statistics, from which estimates of all (properly parametrized) symmetric norms can be simultaneously computed. To illustrate the difference, suppose we wanted to release approximations of the L p norm of the stream for k different values of p. To guarantee (ε, δ)-DP for the set of k statistics, we would need, by advanced composition, to demand O ε √ k , O δ k -DP from k instances of a single differentially private L p -estimation algorithm, corresponding to the k different values of p. Due to accuracy-privacy tradeoffs, the quality of the estimation will degrade severely as k increases. By comparison, our algorithm releases a single set C of private statistics. By post-processing, we can then estimate the L p norms for k different values of p while only requiring (ε, δ)-DP from C. Hence, our algorithm can simultaneously handle any large number of estimations of symmetric norms without compromising the quality of approximation. Theorem 1.2. There exists a (ε, δ)-differentially private algorithm that outputs a set C, from which the (1 + α)-approximation to any norm with maximum modulus of concentration at most M can be computed, with probability at least 1 -δ. The algorithm uses M 2 • poly 1 α , 1 ε , log n, log 1 δ bits of space. The maximum modulus of concentration of a norm measures the worst-case ratio of the maximum value to the median value of a norm on the L 2 -unit sphere for any restriction of the coordinates and can intuitively quantify the complexity of computing a norm. For example, the L 1 norm is generally "easy" to compute and has maximum modulus of concentration O (log n). We emphasize that prior to our work, there is no algorithm that can handle private symmetric norm estimation, much less simultaneously for all parametrized symmetric norms. Although there is specific analysis for various norm estimation algorithms, e.g., see the discussion on related work in the supplementary material, these algorithms require a specific predetermined norm for their input. Thus a separate private algorithm must be run for each estimation, which increases the overall space. Moreover, for a large number of queries, the privacy parameter will need to be much smaller due to the composition of privacy, and thus to ensure privacy, the utility of each algorithm is provably poor. Our algorithm sidesteps both the space and accuracy problems and is the first and only work to do so. Applications. We briefly describe a number of specific symmetric norms that are handled by Theorem 1.2 and commonly used across various applications in machine learning. We first note the following parameterization of the previously discussed L p norms.



Lp for p ∈ (0, 1) does not satisfy the triangle inequality and therefore is not a norm, but is still welldefined/well-motivated and can be computed

