PRIVATE DATA STREAM ANALYSIS FOR UNIVERSAL SYMMETRIC NORM ESTIMATION

Abstract

We study how to release summary statistics on a data stream subject to the constraint of differential privacy. In particular, we focus on releasing the family of symmetric norms, which are invariant under sign-flips and coordinate-wise permutations on an input data stream and include L p norms, k-support norms, top-k norms, and the box norm as special cases. Although it may be possible to design and analyze a separate mechanism for each symmetric norm, we propose a general parametrizable framework that differentially privately releases a number of sufficient statistics from which the approximation of all symmetric norms can be simultaneously computed. Our framework partitions the coordinates of the underlying frequency vector into different levels based on their magnitude and releases approximate frequencies for the "heavy" coordinates in important levels and releases approximate level sizes for the "light" coordinates in important levels. Surprisingly, our mechanism allows for the release of an arbitrary number of symmetric norm approximations without any overhead or additional loss in privacy. Moreover, our mechanism permits (1 + α)-approximation to each of the symmetric norms and can be implemented using sublinear space in the streaming model for many regimes of the accuracy and privacy parameters.

1. INTRODUCTION

The family of L p norms represent important statistics on an underlying dataset, where the L p norm of an n-dimensional vector freqeuncy x is defined as the number of nonzero coordinates of x for p = 0 and L p (x) = (x p 1 + . . . + x p n ) 1/p for p > 0. Thus, L 0 norm counts the number of distinct elements in the dataset and, e.g., is used to detect denial of service or port scan attacks in network monitoring (Akella et al., 2003; Estan et al., 2003) , to understand the magnitude of quantities such as search engine queries or internet graph connectivity in data mining (Palmer et al., 2001) , to manage workload in database design (Finkelstein et al., 1988) , and to select a minimum-cost query plan in query optimization (Selinger et al., 1979) . The L 1 norm computes the total number of elements in the dataset and, e.g., is used for data mining (Cormode et al., 2005) and hypothesis testing (Indyk & McGregor, 2008) , while the L 2 norm, e.g., is used for training random forests in machine learning (Breiman, 2001) , computing the Gini index in statistics (Lorenz, 1905; Gini, 1912) , and network anomaly detection in traffic monitoring (Krishnamurthy et al., 2003; Thorup & Zhang, 2004 ). Consequently, L p estimation has been extensively studied in the data stream model (Alon et al., 1999; Indyk & Woodruff, 2005; Indyk, 2006; Li, 2008; Kane et al., 2011; Andoni, 2017; Braverman et al., 2018b; Ganguly & Woodruff, 2018; Woodruff & Zhou, 2020; 2021) . The simplest streaming model is perhaps the insertion-only model, in which a sequence of m updates increments coordinates of an n-dimensional frequency vector x and the goal is to compute or approximate some statistic of x in space that is sublinear in both m and n. In many cases, the underlying dataset contains sensitive information that should not be leaked. Hence, an active line of work has focused on estimating L p norms for various values of p, while preserving differential privacy (Mir et al., 2011; Blocki et al., 2012; Smith et al., 2020; Bu et al., 2021; Wang et al., 2021) . Definition 1.1 (Differential privacy). (Dwork et al., 2006) Given ε > 0 and δ ∈ (0, 1), a randomized algorithm A : U * → Y is (ε, δ)-differentially private if, for every neighboring streams S and S ′

