DENSITY SKETCHES FOR SAMPLING AND ESTIMATION Anonymous authors Paper under double-blind review

Abstract

There has been an exponential increase in the data generated worldwide. Insights into this data led by machine learning (ML) have given rise to exciting applications such as recommendation engines, conversational agents, and so on. Often, data for these applications is generated at a rate faster than ML pipelines can consume it. In this paper, we propose Density Sketches(DS) -a cheap and practical approach to reducing data redundancy in a streaming fashion. DS creates a succinct online summary of data distribution. While DS does not store the samples from the stream, we can sample unseen data on the fly from DS to use for downstream learning tasks. In this sense, DS can replace actual data in many machine learning pipelines analogous to generative models. Importantly, unlike generative models, which do not have statistical guarantees, the sampling distribution of DS asymptotically converges to underlying unknown density distribution. Additionally, DS is a one-pass algorithm that can be computed on data streams in compute and memory-constrained environments, including edge devices.

1. INTRODUCTION

With the advent of big data, the rate of data generation is exploding. For instance, Google has around 3.8 million search queries per minute, amounting to over 5 billion data points or terabytes of data generated daily. Any processing over this data, such as using the training of a recommendation model, would suffer from data explosion. By the time existing data is consumed, newer data is already available. In such cases, we need to discard a lot of data. One of the critical research directions is how to reduce data storage. In this paper, we present Density Sketches (DS): an efficient and online data structure for reducing redundancy in data. Often data comes from an underlying unknown distribution, and one of the challenges in data reduction is maintaining this distribution. In DS, we approximately store the data distribution in the form of a sketch. Using this DS, we can perform point-wise density estimation queries. Additionally, we can sample synthetic data from this sketch to use in downstream machine learning tasks. This paper shows that data sampled from DS asymptotically converges to the underlying unknown distribution. We can also view density sketches through the lens of coresets. Specifically, DS is a compressed version of grid coresets. Grid coresets are the oldest form of coresets, giving lower additive errors than modern coresets. However, grid coresets are generally prohibitive as they are exponential in dimension (d) . DS enables us to approximate grid coresets with the dependence of memory usage depending on the actual variety in the data instead of being exponential in d. Also, DS provides a streaming construction for this coreset. In this paper, we focus more on the density estimation and sampling aspects of DS. Sampling from a distribution described using data requires estimating the underlying distribution. Popular methods to infer the distribution and sample from it belong to the following three categories: 1. Parametric density estimation (Friedman et al., 2001) 2. Non-parametric estimation -Histograms and Kernel Density Estimators (KDE) (Scott, 2015) 3. Learning-based approaches such as Variational Auto Encoders (VAE), Generative Adversarial Networks (GANs), and related methods (Goodfellow et al., 2014; 2016) . Generally, parametric estimation is not suitable to model most real data as it can lead to significant, unavoidable bias from the choice of the model (Scott, 2015) . Learning the distribution, e.g., via neural networks, is one solution to this problem. Although learning-based methods have recently found remarkable success, they do not have any theoretical guarantees for the distribution of generated samples. Histograms and KDEs, on the other hand, are theoretically well understood. These statistical estimators of density are known to uniformly converge to the underlying true distribution almost surely. This paper focuses on such estimators, which have theoretical guarantees. Storage of histograms and sampling from them is expensive because of an exponential number of partitions (also known as bins). Apart from this, histograms also suffer from the bin-edge problem: a slight variation in data can lead to significant differences in the estimation of densities. KDEs are used to solve the bin-edge problem. KDE gives a smoother estimate of density. While sampling from a KDE is efficient, KDE is expensive to store. KDE requires us to store the entire data. Coresets for KDE are a good solution to the storage problems of KDE. However, the construction of coresets is typically quite expensive. In this work, we propose Density Sketches(DS) -a compressed sketch of density constructed in an efficient streaming manner. DS does not store actual samples of the data. But we can still efficiently produce samples from a KDE for specific kernels, which, in turn, approximates f (x). Being a compressed sketch, we can tune the accuracy-storage trade-off of DS, and we analyze this trade-off in the theorem 1.

2. PROBLEM STATEMENT AND RELATED WORK

Problem Statement: Formally, we want to create a data structure that has the following properties : (1) It sketches density information. (2) The sketch size is much smaller than the data size and does not scale linearly with it. (3) The construction is streaming and efficient. (4) We do not store any samples in the data structure created (for privacy reasons). ( 5) We want the sampling distribution, say fS (x), obtained by sampling from these data structures to approximate the true underlying distribution f (x). The problem we aim to solve can be considered a data reduction problem and has been widely pursued in literature. The set of existing approaches can be broadly classified into two sections. (1) Sampling based / Coresets : Approaches such as clustering/importance sampling (Charikar & Siminelakis, 2017; Cortes & Scott, 2016; Chen et al., 2012) and coresets for KDE (Phillips & Tai, 2020; 2018) fall under this category. These approaches aim to find a small set of possibly weighted samples for a specific objective function such that the result obtained by applying the function to this small set is within a small approximation error of the result obtained by applying the objective function on a complete dataset. The issue with these approaches is that of efficiency. Most of these algorithms require complicated computation over the entire data. Some streaming algorithms were recently proposed for coresets for KDE (Karnin & Liberty, 2019) . However, even these algorithms need to perform O(m) (m is compactor size) computationally expensive operations per sample for large chunks of size (m), making them unsuitable for our purposes. (2) Dimensionality reduction: These approaches aim to reduce the width of the data matrix. Approaches such as Principle Component Analysis (PCA) are computationally expensive and require iterative computation over the entire dataset. Random projections are an efficient streaming algorithm for dimensionality reduction. However, this approach leads to compressed data that increases linearly with the original data size. As we can see, existing approaches fall short of the requirements in our problem statement.

3.1. HISTOGRAMS AND KERNEL DENSITY ESTIMATION

Histograms and KDE (Scott, 2015; Scott & Sain, 2004) are popular methods to estimate the density of a distribution given a finite i.i.d. sample of n points in R d drawn from the true density, say f (x). and data, say D, the KDE at point x is defined as fK (x) = KDE(x) = 1 n n i=1 k(x, x i ) where x i ∈ D Kernel functions are positive, symmetric, and may be normalized to integrate to 1. Gaussian, Epanechnikov, Uniform, (Friedman et al., 2001) are some of the most widely used kernels. A smoothing parameter B also parameterizes the kernel function and determines the standard deviation parameter for the Gaussian kernel function. For uniform and Epanechnikov kernel functions, B is the window width around x where the kernel is non-zero. As B increases, the bias of KDE increases, and its variance decreases. The count sketch (CS) (Cormode & Muthukrishnan, 2009; Charikar et al., 2002) , along with its variants, is one of the most popular probabilistic data structures used for the heavy hitter problem. Given a stream of (a t , c t ) key-value pairs, a t ∈ U, CS stores the compressed total counts for each of the keys in a small K × R array of integers and can be queried to retrieve this total count, C(a t ). CS offers a probabilistic solution in memory logarithmic in a total number of unique keys. There is a standard memory accuracy trade-off for CS. Let m be the number of distinct keys and C be the vector of counts indexed by each key. For count median sketch (Charikar et al., 2002) , the ( , δ) guarantee P(| Ĉ(a) -C(a)| > ||C|| 2 ) ≤ δ is achieved using O( 1 2 1 δ (log m + log |U|)) space. (Chakrabati, 2020) . As seen from the above equation, the approximation accuracy for a particular key depends on how it compares to the ||C|| 2 . Specifically, CS can give an excellent approximation for keys with the highest values in a setting where most other keys have very low values. More discussion on CS can be found in Appendix F.

3.3. LOCALITY SENSITIVE HASHING

Locality-sensitive hashing(LSH) (Darrell et al., 2005) is a popular approach to solving approximate near-neighbor problems. If a function h : U → {0, ...r -1} for some r, is randomly drawn from the LSH family L, the probability of collision of the hash values for two distinct elements a 1 and a 2 is P h∈L (h(a 1 )==h(a 2 )) ∝ Sim(a 1 , a 2 ) Where Sim(a 1 , a 2 ) is some similarity metric corresponding to the LSH family. The probability of collision is referred to as the kernel of the LSH family, generally denoted by φ(., .). Most kernels are positive, bounded, symmetric, and reflective. We can use p independent LSH functions, h 1 , h 2 , ...h p to obtain a LSH function, h (p) (a) = (h 1 (a), h 2 (a), ..., h p (a)). The function h (p) has kernel ψ(, ., ) = φ(., .) p . We call p the power of the LSH function. Popular LSH functions for U = R d are L2-LSH, L1-LSH and SRP (signed random projection). More details on LSH functions can be found in (Darrell et al., 2005) 3.4 UNIFORM SAMPLING FROM CONVEX POLYTOPES Uniform sampling from convex spaces is a well-studied problem (Bélisle et al., 1993; Chen et al., 2017) . For general convex polytopes, this is achieved by finding a point inside the polytope using convex feasibility algorithms and then running an MCMC walk inside the polytope to generate a point with uniform probability. In the case of regular convex polytopes like hypercubes and parallelopiped, uniform sampling is much simpler. Sampling a data point at random in a d-dimensional hypercube of width 1 is equivalent to uniformly sampling d real values in the interval [0, 1]. For sampling within a d-dimensional parallelopiped, we first locate (d -1)-dimensional hyperplane parallel to each face at a distance drawn uniformly from [0, B] where B is the width of parallelopiped in that direction. The sampled point is the intersection of these (d -1)-dimensional hyperplanes.  → N d Sampling s ∈ R d from b ∈ N d Regular histogram B ∈ R bin(x) i = x i /B r i ∼ U (0, 1), r ∈ R d s = B(b + r) Aligned histogram B ∈ R d bin(x) i = x i /B i r i ∼ U (0, 1), r ∈ R d s = B • (b + r) L1/L2-LSH W ∈ R d×d B ∈ R d , t ∈ R d bin(x) i = ( x, W i + t i )/B i r i ∼ U (0, 1), r ∈ R d y = B • (b + r) s = solve(Ws = y -t) SRP W ∈ R k×d bin(x) i = (sign( x, W i ) MCMC with constraints, sign( x, W i ) = b i + bounding box

4. DENSITY SKETCHES

In DS, we aim to build a compressed non-parametric estimation object in an efficient streaming fashion. As KDEs give a better approximation of underlying function f (x) than histograms, we want to build DS as a compressed KDE object. To achieve this, we use a nice connection between KDE and Histograms with an LSH-based partition function.

4.1. HISTOGRAM WITH LSH-BASED PARTITION AND KERNEL DENSITY ESTIMATES

Any LSH function on R d will partition the space into different bins. Specifically, if power d L1/L2-LSH, these partitions will be polytopes in R d . Similarly, power k SRP would give conical partitions with hyper-plane boundaries. We can employ a histogram-based estimation strategy on the top of these randomly drawn partitions. The density estimate using such a histogram would be fH (x) ∝ 1 n n i=1 I(x i ∈ bin(x)) where x i ∈ D where I is a indicator function. This estimate of the density has an expected value (over random partitions) equal to the KDE estimate, say fφ (x) with the corresponding LSH kernel, φ(., .) E p ( fH (x)) = 1 n n i=1 P (x i ∈ bin(x)) = 1 n n i=1 φ(x i , x) = fφ (x) The expectation is over random partitions. This connection between randomized histograms and KDE was first observed in (Coleman & Shrivastava, 2020) . To better approximate KDE, we can combine results from multiple histograms with independent LSH functions. For example, if we use m independent histograms, say H 1 , H 2 , ..., H m , then the density estimate can be written as f (m) H (x) = 1 m m i=1 fHi (x) We can sample a data point from this set by first choosing a histogram randomly and then sampling a point from that histogram. One can check that the sampling distribution, thus obtained, is f (m) H (x).

4.2. CONSTRUCTING DENSITY SKETCHES

Now that we have reduced the problem of KDE approximation to histograms, we will now show how to obtain a compressed representation of a histogram in a streaming fashion. We also show how to generate samples from this representation. First, let us establish some notation we will use Notation: (1) Data D consists of n i.i.d samples of dimension d drawn from true distribution f (x) : S ⊂ R d → R. (2) bin(x): ID of the partition in which point x falls. In the case of p-power LSH functions, bin(x) : R d → N p , and each bin can be identified with a unique tuple of p integers. In a regular histogram, we have a tuple of d integers. For example, in regular histogram with width B, bin(x) i = x i /B . bin is generally parameterised with bandwidth parameter B which measures the size of the partition. Some partitioning schemes and sampling algorithms are mentioned in Table 1 (3) A CS M with range R and repetitions K as described in section 3. (4) H: Augmented min-heap of size H used with M. Hence for a given partitioning scheme (bin, B), DS is parameterized by (K, R, H) and includes two data structures M(K, R) and H(H). The histogram has an exponential (in d) number of partitions. Hence, in high dimensions, it is impractical to store histograms. However, most high dimensional real data is clustered and thus has highly sparse histograms. This does not help with histograms, as post-pruning of histograms still requires us to build and enumerate them. Nevertheless, the sparsity in histogram makes it a good candidate for heavy hitter problems. We use CS, M, to store a compressed version of the histogram. Unfortunately, sampling with just M does not have an efficient solution. We maintain a set of heavy partitions for sampling in the min-heap H. We will discuss sampling in later subsections. sketching M : As shown in figure 4 and algorithm 1, we process the data in a streaming fashion. For each data point, say x, we find the partition b = bin(x). We increment the count of b by 1 by inserting (b, 1) into M. Along with each insertion, we also update H. If the H is not at its capacity, we insert this b into the heap along with its updated count Ĉ(b). If the heap is at its capacity, we check b's updated count against the minimum of the H. If b's count is greater, we pop the minimum element from the heap and insert (b, Ĉ(b)).

4.3. fC (x): ESTIMATE OF DENSITY AT A POINT

We can use M for querying the density estimate at a particular point. The algorithm for querying is presented in Algorithm 2 and is explained in figure 2. Reusing notation from 3.1, the density predicted by the histogram can be written as fH (x). When using the sketch, instead of actual C(bin(x)), we would use its estimate from M. Let this estimate be Ĉ(bin(x)). Then we can write the density predicted using count-sketch as fC (x) fH (x) = C(bin(x)) nV(bin(x)) fC (x) = Ĉ(bin(x)) nV(bin(x)) We know from CS literature that Ĉ(bin(x)) is closely distributed around C(bin(x)) and so we can expect fC (x) to be close to fH (x) and hence to f (x). Note that though fC (x) is a good estimate of density at a point x, the function fC (.) is not a density function as it does not integrate to 1. 4.4 f * C (x): ESTIMATE OF DENSITY FUNCTION To obtain a density function from the sketches, we have to normalize the function fC (x) over the support. We can write f * C (x) as f * C (x) ∝ Ĉ(x) f * C (x) = Ĉ(x)) Ĉ(x)dx = Ĉ(x) V(bin(x)) b∈bins(S) Ĉ(b) = Ĉ(x) V(bin(x))n It is easy to check the integral can be written as the sum over all the bins in the support. As is clear from the equations for f * C (x) and fC (x), n = b∈bins(S) C(b) , is replaced by n = b∈bins(S) Ĉ(b) to get a density function. We can check that n is an estimate of n using an estimate of count for each bin from the DS. 4.5 fS (x): SAMPLING FROM DENSITY SKETCHES M is a good enough representation for querying the density at a point. However, it is not the best data structure to generate samples efficiently. One naive way of sampling from these sketches is to - P (b) = (H[b]/n h ) if b ∈ H - P (b) = 0 if b / ∈ H b ∼ P y = UniformRandomPoint(b) return y randomly select a point in support of f (x) and then do a rejection sampling using estimate fC (x). However, given the enormous volume of support in high dimensions, this method is bound to be immensely inefficient. Another way is to choose a partition with probability proportional to the count of elements in that partition and then sample a random point from this chosen partition. It is easy to check that the probability of sampling a point x in this manner, precisely, is fH (x) if we use exact counts and f * C (x) if we use approximate counts from CS. However, given that number of bins is exponential in dimension, sampling a bin proportional to its counts requires prohibitive memory and computation. This is why we needed a CS in the first place. Here, we further approximate the distribution by storing only top H partitions which contain most data points and discarding other partitions. As mentioned in 1, we can efficiently maintain top H partitions with an augmented heap H. We then sample a partition present in this heap with probability proportional to its count and sample a random data point from this partition (Algorithm 3). The probability of sampling a data point whose bin is not present augmented heap is then zero. The distribution of this sampling algorithm is, fS (x) = I(bin(x) ∈ H) Ĉ(bin(x)) nh V(bin(x)) where nh = b∈H Ĉ(b) is the count-sketch estimate of the total number of elements captured in all partitions present in the heap. I(.) is the indicator function with values 0 or 1 evaluating the boolean statement inside it. Let ρ h = nh /n be the capture ratio of heap. It is easy to see that as the capture ratio tends to 1, fS (x) tends to f * C (x). Note that fS (x) is a density function.

5. ANALYSIS

Histogram and Kernel Density Estimators are well-studied non-parametric estimators of density. Both of these estimators are shown to be capable of approximating a large class of functions (Scott, 2015) . For example, with the condition of Lipschitz Continuity on f (x), we can prove that pointwise MSE( fH (x) converges to 0 at a rate of O(n -2/3 ). Better results can be obtained for functions that have continuous derivatives. In our analysis, we make assumptions along those made in (Scott, 2015) ; specifically, the existence and boundedness of all function-dependent terms that appear in the theorems below. We refer the reader to (Scott, 2015) for an in-depth discussion on assumptions. We restrict our analysis to convergence in probability for all the estimators discussed in this paper, which is the standard (Scott, 2015) . In this section, we consider the regular histogram partitioning scheme and show that our sampling distribution fS (x) is an approximation of underlying distribution f (x) and converge to it. However, a similar analysis holds even for random partitioning schemes / KDE and is skipped here. Mean integrated square error(MISE): MISE of an estimator of function is a widely used tool to analyze the performance of a density estimator. MISE( f ) = E ( f (x) -f (x)) 2 dx A density estimator, with MISE asymptotically tending to zero, is a consistent estimator of true density and converges to it in probability. We would use this tool to make statements about the convergence of our estimators. By Fubini's theorem, MISE is equal to IMSE (Integrated mean square error). MISE( f ) = IMSE( f ) = E ( f (x) -f (x)) 2 ) dx We now present our main result of the paper, Theorem 1 (Main Theorem: fS (x) to f (x) ). The probability density function of sampling, fS (x), using a DS over regular histogram of width B, with parameters(K,R,H) created with n i.i.d samples from original density function f (x), has an IMSE given by IMSE( fS (x)) ≤ 12(1 -ρ h ) 2 + 3(1 + 2 ) 1 nB d + G(f ) n + o 1 n + n nzp -1 KRnB d + 3(1 + 3 ) B 2 d 4 G( ∇f 2 ) + 3 1 + 2G(f ) + B √ d x∈S (f (x) ∇f 2 ) with probability (1 -δ) , where δ = nnzp 2 nKR , n nzp is the number of non-empty bins in histogram, ρ h is the estimated capture ratio as described in section 4.5 and G(g) is roughness defined as g(x) 2 dx The dependence of IMSE on properties of f (x), such as roughness, is standard (Scott, 2015) and cannot be avoided. Interpretation The estimator fS (x) of f (x) is obtained by a series of approximations from f (x) → fH (x) → fC (x) → f * C (x) → fS (x). Hence to interpret this result, we break down the result above into multiple theorems enabling the reader to easily notice which step of approximations leads to what terms in the theorem above. We provide these details in Appendix. We notice a few things from the theorem below • Similar to the standard analysis for histograms, the curse of dimensionality also manifests in our theorem. B should go to zero and n should increase faster than the rate at which B d /n nzp decreases (condition 1). As compared to standard histograms, this requires n to grow faster. With these conditions on B and n, it is clear how the second and third terms go to zero. • The magnitude of the fourth term is controlled via . The above statement is true for any δ and that are related via the expression δ = n nzp /( 2 nKR) . Choose arbitrarily small and δ, and we can achieve it with large enough n/n nzp or by providing more intermediate resources and making KR large enough. For a fixed resource KR, this term goes to zero asymptotically with n growing faster than n nzp , which is a sub-condition of condition 1. • The term 12(1 -ρ h ) 2 shows the effect of truncation that occurs due to using only heavy partitions. As can be seen, this term is data dependent, and IMSE does not depend directly on H (number of partitions) but ρ h . Suppose we can capture the entire data in the heap (i.e., setting H=n nzp ), then the term adds no penalty to IMSE. H, via ρ h controls the accuracy-memory trade-off of DS.

6. DISCUSSION

curse of dimensionality: As DS are built over Histograms, they inherit the curse from Histograms: i.e., the number of samples needed increases exponentially with dimension. With increased data collection, the issue of the unavailability of large amounts of data is fast vanishing. We want to emphasize that DS' advantages are best seen when data is humongous. DS can absorb tons of data and give better density estimates and samples without increasing memory usage. Also, most real data in high dimensions is clustered or stays on a low-dimensional manifold. DS, throw away empty bins, and only store the histogram's populated bins. DS can deal with the curse of dimensionality better than Histograms. DS on original data space: Some data types, like images, do not reside in a space where the usual distances or cosine similarities imply conceptual similarity. On these data types, DS will not perform well. One way is to learn a transformation and create sketches of the transformed data. While this will give better performance in practice, we might lose theoretical guarantees for certain transformations.

7. EXPERIMENTS

Visualization of samples from density sketches: In the first set of experiments, we provide a sanity check for DS in the form of visualization of data generated from DS. (1) In the first experiment,  and H=5000 ) . We should notice that MNIST with 784 dimensions and 60K samples is not an ideal dataset for DS. In fact, with L2-LSH partitions the data would be so scattered that every sample is contained in its bin. If we make the bin-width finer, we should sample data points very close to the random sample from the original data. So in the worst conditions, DS converges to a random sample which we know is a good representation of data. 4 (d) shows results again with MNIST. However, here we have created conical partitions (created using multiple signed random projections). While L2-LSH partitions use power 784 L2-LSH functions, in this experiment, we use a smaller number of SRP functions(10-25, increases as we go to lower rows in the image), thus promoting clustering. As expected, this coarse partitioning does show significant clustering; hence, the images drawn from the partitions look like the average of multiple samples from the original data. The results support that DS can create samples that resemble the original data. Evaluation of Samples on Classification Tasks: For most datasets, it is not possible to inspect samples visually. Hence we evaluate the quality of samples from DS by using them to train classification models. In these experiments, the data loader of the training algorithm is replaced with a sampler from DS. This sampler returns a training batch when requested by the algorithm. All the experiments are performed on Tesla P-100 GPUs with 16GB memory. Datasets: We choose big datasets from the liblinear website (Chang & Lin, 2011) , which satisfy the constraints of 1) data dimension less than 100 and 2) the number of samples per class greater than 1,000,000. Large Datasets is the main application domain for DS. Thus, we have datasets of Higgs (10M samples, 28 dimensions) and Susy (5M samples, 18 dimensions) for our experiments. Baselines: For baselines, we consider random samples of the same size and Liberty coresets proposed by (Karnin & Liberty, 2019) to compare DS performance. For Liberty coresets, we use m = 100 as for larger m the process is very slow. Dimensionality reduction via random projections is another streaming algorithm. Still, in these datasets, we cannot get significant compression using Larger B implies that we will capture more space than needed in a single partition, and smaller B implies that we will capture lesser data in the heap. So it is expected that a sweet spot for B exists. In figure 5 (e), we fix (B=0.01,H=100000, KR=1000000) and vary K from 1 to 64. We can see that for a reasonable memory budget the results are stable with varying K. For the experiments in figure 5(a-d ) we fix (B=0.01, K=5, R=250000) and vary H. This gives us DS of different sizes. We plot test accuracy and losses for DS, random samples and Liberty coresets for different sizes of memory used. The width of the band signifies the 2 × std-dev of performance on three independent runs. As can be seen, for the "Higgs" dataset, the model's accuracy achieved on original data of size 2.5GB can be closely reached by using a DS of size 50MB. So we get around 50x compression ! We see similar results for datasets of "Susy" (100x compression, 0.8GB) as well. The results show that DS is much more informative than Random Sample and Liberty Coresets. For more details on running the experiments (data processing, memory measurements, etc.), refer to Appendix F.

Estimation of statistical properties of dataset:

We also perform the experiments on the covariance estimation task. The observations are similar to the classification experiment. DS performs better than the corresponding random sample. The results are presented in Appendix F for the shortage of space.

8. CONCLUSION

In this paper, we talk about Density Sketches, a streaming algorithm to construct a summary of density distribution from data. We show that new samples generated from this sketch asymptotically converge to the underlying distribution. Thus, DS comes with theoretical guarantees. Additionally, the cheap nature of online updates in Density Sketches, makes it an attractive alternative to constructing coresets for the data. In terms of coresets, DS can be viewed as a compressed form of randomized grid-coresets -one of the oldest forms of coresets. While estimating true distribution f (x) : R d → R, the integrated mean square error (IMSE) for the estimator fH (x) using regular histogram with width h and number of samples n, is IMSE( fH ) ≤ 1 nh d + G(f ) n + o 1 n + h 2 d 4 G( ∇f 2 ) Specifically, we have integrated variance (IV) and integrated square bias (ISB) as follows IV( fH ) = 1 nh d + G(f ) n + o 1 n and ISB( fH ) ≤ h 2 d 4 G( ∇f 2 ) where G(φ) is the roughness of the function φ defined as G(φ) = φ 2 (x)dx Proof. Let x ∈ S where S is the support of the distribution. Let bin(x) determine the bin of point x, bins(S) enumerate all the bins that lie inside the support S of the distribution f (x) . Let V (bin(x)) is volume of bin in which x lies. Equivalently, we can also use V (b) to denote volume of bin b. For standard histogram, V (b) = h d fH (x) = 1 nV (bin(x)) n i=1 I(x i ∈ bin(x)) First let us consider the integrated variance. IV = x∈S Var( fH (x))dx = b∈bins(S) x∈b Var( fH (x))dx For a particular bin b, the variance is constant at all values of x inside it. Also for a particular x in bin b, we can write the following for Var( fH (x)) using independence of samples. Var( fH (x)) = 1 nV (bin(x)) 2 Var(I(x i ∈ bin(x)) (3) Also Var(I(x i ∈ b)) = p b (1 -p b ) where p b is the probability of x i lying in bin b. That is, p b = x∈b f (x)dx Using this in equation 2 IV = b∈bins(S) V (b) 1 nV 2 (b) p b (1 -p b ) Simplifying, IV = b∈bins(S) 1 nV (b) p b (1 -p b ) For standard histogram V (b) is same across bins, IV = 1 nV (b) ( b∈bins(S) p b - b∈bins(S) p 2 b ) (6) = 1 nV (b) (1 - b∈bins(S) p 2 b ) Using mean value theorem, we can write, p b = V (b)f (ξ b ) for some point ξ b ∈ b. b∈bins p 2 b = b∈bins V (b) 2 f (ξ b ) 2 = V (b) b∈bins V (b)f (ξ b ) 2 Using Riemann Integral approximation , we can write the following as the bin size reduces, b∈bins V (b)f (ξ b ) 2 = x∈S f 2 (x)dx + o(1) x∈S f 2 (x)dx is also known as the roughness of the function. Let us denote it using G(f ). Hence IV = 1 nV (b) (1 -V (b) (G(f ) + o(1))) IV = 1 nV (b) - G(f ) n -o 1 n (11) Putting V (b) = h d IV = 1 nh d - G(f ) n -o 1 n Keeping only the leading term in the above expression, IV = O 1 nh d Now let us look at the ISB for this estimator, ISB( fH (x)) ISB( fH (x)) = x∈S (E( fH (x) -f (x))) 2 dx Let us look at the expected value of the estimator, E( fH (x)) = 1 V (bin(x)) t∈bin(x) f (t)dt Recall that x ∈ R d . Using 2nd order multivariate taylor series expansion of this f (t) around x, we get, f (t) = f (x) + t -x, ∇f (x) + 1 2 (t -x) H(f (x))(t -x) Here H(f (t)) is the hessian of f at t. Without the loss of generality let us look at the bin(x) = [0, h] d that is the bin at the origin. Let us say it is b 0 t∈b0 f (t)dt = h d f (x) + h d ( h 2 -x, ∇f (x) + O(h d+2 ) Using eq 17 in eq 15, we get E( fH (x)) = f (x) + ( h 2 -x), ∇f (x) + O(h 2 ) Hence, just keeping the leading term , we have Bias( fH (x)) = ( h 2 -x), ∇f (x) ) 2 dx = x∈b0 h 2 -x , ∇f Using Cauchy-Schwarz inequality, we get x∈b0 Bias( fH (x)) 2 dx ≤ x∈b0 ( h 2 -x) 2 2 ∇f (x) 2 2 dx As [h/2, h/2, ...h/2] is a mid point of the bin. The max norm of x -h/2 can be h √ d/2 x∈b0 Bias( fH (x)) 2 dx ≤ h 2 d 4 x∈b0 ∇f (x) 2 2 dx Now looking at ISB Proof. Consider a Countsketch with range R and just one repetition (i.e. K = 1). Let it be parameterized by the randomly drawn hash functions g : bins(S) -→ {0, 1, 2, ..., R -1} and s : bins(S) -→ {-1, +1}. Let C(bin(x)) n i=1 (I(x i ∈ bin(x)) is the count of elements that lie inside the bin(x) ISB( fH ) = b∈bins x∈b0 Bias( fH (x)) 2 dx ≤ h 2 d 4 x∈S ∇f (x) 2 2 dx (23) ISB( fH ) ≤ h 2 d 4 G( ∇f 2 ) The estimate of density at point x can then be written as fC (x) = 1 nV (bin(x)) C(bin(x)) + n i=1 I x i / ∈ bin(x) ∧ g(bin(x i )) = g(bin(x)) s(bin(x i ))s(bin(x)) (25) We can rewrite this as , fC (x) = fH (x)+ 1 nV (bin(x)) n i=1 I x i / ∈ bin(x) ∧ g(bin(x i )) = g(bin(x)) s(bin(x i ))s(bin(x)) As E(s(b)) = 0, it can be clearly seen that. E( fC (x)) = E( fH (x)) Hence, it follows that ISB( fC (x)) = ISB( fH (x)) It can be checked that each of the terms in the summation for right hand side of equation 26 including the terms in fH (x) are independent to each other . i.e. covariance between them is zero. Hence we can write the variance of our estimator as, Var( fC (x)) = Var( fH (x))+ (29) 1 nV 2 (bin(x)) Var I x i / ∈ bin(x) ∧ g(bin(x i ))=g(bin(x)) s(bin(x i ))s(bin(x)) Var( fC (x)) = Var( fH (x))+ (31) 1 nV 2 (bin(x)) E I x i / ∈ bin(x) ∧ g(bin(x i ))=g(bin(x)) 2 (32) As square of indicator is just the indicator, Var( fC (x)) = Var( fH (x))+ (33) 1 nV 2 (bin(x)) E I x i / ∈ bin(x) ∧ g(bin(x i ))=g(bin(x)) (34) V ar( fC (x)) = V ar( fH (x)) + 1 nV 2 (bin(x)) (1 -p bin(x) ) 1 R ) Hence, IV is IV( fC (x)) = IV( fH (x)) + x∈S 1 nV 2 (bin(x)) (1 -p bin(x) ) 1 R ) (36) IV( fC (x)) = IV( fH (x)) + b∈bins(S) x∈b 1 nV 2 (b) (1 -p b ) 1 R ) (37) IV( fC (x)) = IV( fH (x)) + b∈bins(S) 1 nV (b) (1 -p b ) 1 R ) (38) Assuming standard partitions. V (b) = h d for all b IV( fC (x)) = IV( fH (x)) + 1 nh d (n nzp -1) R With mean recovery, with K repetitions, the analysis can be easily extended to get IV as IV( fC (x)) = IV( fH (x)) + 1 nh d (n nzp -1) KR The ISB remains same in this case. A.4 THEOREM 4: f * C TO fC While estimating true distribution f (x) : R d → R, the IMSE for the estimator f * C (x) using regular histogram with width h and number of samples n and countsketch with parameters (R:range, K:repetitions), is related to the estimator fC (x) as follows  IMSE( fC (x)) -(N + 2M ) ≤ IMSE( f * C (x)) ≤ IMSE( fC (x)) + (N + 2M ) Specifically, IV( fC (x)) -2 M ≤ IV( f * C (x)) ≤ IV( fC (x)) + 2 M and ISB( fC (x)) -N ≤ ISB( f * C (x)) ≤ ISB( fC (x)) + N where M ≤ IV( fC (x)) + 2(G(f ) + h 2 d 4 G( ∇f 2 ) + h √ d x∈S (f (x) ∇f 2 )) N = (1 + ISB( fC (x))) with probability (1 -δ) where δ = I(x i ∈ b) + I x i / ∈ b ∧ g(bin(x i ) == g(b))s(bin(x i ))s(b) (42) n = b,i I(x i ∈ b) + I(x i / ∈ b ∧ g(bin(x i )) == g(b))s(bin(x i ))s(b) Note that E(n) = n. For varaince, observe that most of the terms in the summation have covariance 0, except the terms Cov(I(x i ∈ b 1 ), I(x i ∈ b 2 )) which are negatively correlated. Hence V ar(n) = b,i V ar(I(x i ∈ b)) + V ar(I(x i / ∈ b ∧ g(bin(x i ))! = g(b))s(bin(x i ))s(b))+ 2 i,b1,b2,b1 =b2 Cov(I(x i ∈ b 1 ), I(x i ∈ b 2 )) We know that V ar(I(x i ∈ b)) = p b (1 -p b ) V ar(I(x i / ∈ b ∧ g(bin(x i )) == g(b))s(bin(x i ))s(b)) = E(I(x i / ∈ b ∧ g(bin(x i ))! = g(b)) 2 ) = 1 -p b R Cov(I(x i ∈ b 1 ), I(x i ∈ b 2 )) = -p b1 p b2 Point wise variance and IV Using the similar arguments E( f 2 C (x)) (1 + ) 2 - E 2 ( fC (x)) (1 -) 2 ≤ V ar( f * C (x)) ≤ E( f 2 C (x)) (1 -) 2 - E 2 ( fC (x)) (1 + ) 2 (63) Again making first order taylor expansions of denominator and ignoring square terms V ar( fC (x))-2 (E( f 2 C (x)+E 2 ( fC (x))) ≤ V ar( f * C (x)) ≤ V ar( fC (x))+2(E( f 2 C (x)+E 2 ( fC (x))) (64) Since, V ar( fC (x)) = E( f 2 C (x)) -E 2 ( fC (x)) V ar( fC (x))-2 (V ar( fC (x))+2E 2 ( fC (x))) ≤ V ar( f * C (x)) ≤ V ar( fC (x))+2 (V ar( fC (x))+2E 2 ( fC (x))) (65) IV ( fC (x))-2 (IV ( fC (x))+2 x∈S E 2 ( fC (x))) ≤ IV ( f * C (x)) ≤ IV ( fC (x))+2 (IV ( fC (x))+2 x∈S E 2 ( fC (x))) (66) Let us now figure out the x∈S E 2 ( fC (x)) x∈S E 2 ( fC (x)) = x∈S E 2 ( fH (x)) (67) From equation 18, E( fH (x)) 2 = f (x) 2 + ( ( h 2 -x), ∇f (x) ) 2 + 2f (x) ( h 2 -x), ∇f (x) x∈S E 2 ( fH (x)) ≤ G(f ) + h 2 d 4 G( ∇f 2 ) + h √ d x∈S (f (x) ∇f 2 ) Hence, IV ( fC (x)) -2 M ≤ IV ( f * C (x)) ≤ IV ( fC (x)) + 2 M Where M ≤ IV ( fC (x)) + 2(G(f ) + h 2 d 4 G( ∇f 2 )) + h √ d x∈S (f (x) ∇f 2 )) A.5 LEMMA 1 Estimators fS (x) and f * C (x), obtained from the Density Sketch with parameters(R,K,H) using histogram of width h built over n i.i.d samples drawn from true distribution have a relation | f * C (x) -fS (x)|dx = 2(1 -ρ h ) where ρ h is the capture ratio as defined in section 3 | f * C (x) -fS (x)|dx = b∈bins x∈b | f * C (x) -fS (x)|dx (71) | f * C (x) -fS (x)|dx = b∈bins(H) x∈b | f * C (x) -fS (x)|dx + b / ∈bins(H) x∈b | f * C (x) -fS (x)|dx we know that for x ∈ b, b / ∈ bins(H), fS (x) = 0. Hence, | f * C (x) -fS (x)|dx = b∈bins(H) x∈b | f * C (x) -fS (x)|dx + b / ∈bins(H) x∈b f * C (x)dx (73) x∈b f * C (x)dx is the probability of a data point lying in that bucket according to f * C (x) | f * C (x) -fS (x)|dx = b∈bins(H) x∈b | f * C (x) -fS (x)|dx + b / ∈bins(H) ĉb n (74) For points x ∈ b, b ∈ bins(H), f * C (x) * n = fS (x) * nh , Hence, fS (x) = n nh f * C (x) | f * C (x) -fS (x)|dx = b∈bins(H) x∈b f * C (x)( n nh -1)dx + b / ∈bins(H) ĉb n (75) | f * C (x) -fS (x)|dx = b∈bins(H) x∈b f * C (x)( n nh -1)dx + b / ∈bins(H) ĉb n (76) | f * C (x) -fS (x)|dx = ( n nh -1) b∈bins(H) ĉb n + b / ∈bins(H) ĉb n (77) | f * C (x) -fS (x)|dx = ( n nh -1)( nh n ) + n -nh n (78) | f * C (x) -fS (x)|dx = (1 - nh n ) + n -nh n (79) | f * C (x) -fS (x)|dx = 2(1 - nh n ) (80) | f * C (x) -fS (x)|dx = 2(1 -ρ h ) A.6 THEOREM 5 The IMSE of estimator fS (x) obtained from the Density Sketch with parameters(R,K,H) using histogram of width h built over n i.i.d samples drawn from true distribution f(x) is IM SE( fS (x)) ≤ 12(1 -ρ h ) 2 + 3IM SE( f * C (x)) where ρ h is the capture ratio as defined in Proof. Giving a very loose relation between fS and f. We can write ( fS (x) -f (x)) 2 dx = (( fS (x) -f * C (x)) -( f * C (x) -f (x))) 2 dx (82) ( fS (x) -f (x)) 2 dx ≤ 3 ( fS (x) -f * C (x)) 2 dx + 3 ( f * C (x) -f (x)) 2 dx (83) ( fS (x) -f (x)) 2 dx ≤ 3( |( fS (x) -f * C (x))|dx) 2 + 3 ( f * C (x) -f (x)) 2 dx (84) ( fS (x) -f (x)) 2 dx ≤ 12(1 -ρ h ) 2 + 3 ( f * C (x) -f (x)) 2 dx (85) IM SE = M ISE( fS (x)) ≤ 12(1 -ρ h ) 2 + 3IM SE( f * C (x)) B THEOREM 1 (MAIN THEOREM) COMBINES ALL OTHER THEOREMS This theorem directly relates the distribution fS (x) to the true distribution f(x). We will combine the following statements IMSE( fH ) ≤ 1 nh d + G(f ) n + o 1 n + h 2 d 4 G( ∇f 2 ) (87) IMSE( fC (x)) = IMSE( fH (x)) + n nzp KRnh d (88) |IMSE( f * C (x)) -IMSE( fC (x))| ≤ (N + 2M ) with probability (1 -δ), δ = n nzp 2 nR (89) IMSE( fS ) ≤ 12(1 -ρ h ) 2 + 3IMSE( f * C (x)) (90) where, M ≤ IV( fC (x)) + 2(G(f ) + h 2 d 4 G(( ∇f 2 )) + h √ d x∈S (f (x) ∇f 2 )) N = (1 + ISB( fC (x))) Let us now combine them IMSE( fS (x)) ≤ 12(1 -ρ h ) 2 + 3IMSE( f * C (x)) (91) IMSE( fS (x)) ≤ 12(1 -ρ h ) 2 + 3 IMSE( fC (x)) + (N + 2M ) (92) IMSE( fS (x)) ≤ 12(1 -ρ h ) 2 + 3 IMSE( fH ) + n nzp -1 KRnh d + (N + 2M ) (93) IMSE( fS (x)) ≤ 12(1-ρ h ) 2 +3 1 nh d + G(f ) n + o 1 n + h 2 d 4 G( ∇f 2 )) + n nzp -1 KRnh d + (N + 2M ) N = (1 + ISB( fC )) N ≤ 1 + h 2 d 4 G( ∇f 2 ) M ≤ IV ( fC ) + 2G(f ) + h 2 d 4 G( ∇f 2 ) + h √ d x∈S (f (x) ∇f 2 ) M ≤ IV ( fH ) + n nzp -1 KRnh d + 2G(f ) + h 2 d 4 G( ∇f 2 ) + h √ d x∈S (f (x) ∇f 2 ) M ≤ 1 nh d + G(f ) n + o 1 n + n nzp -1 KRnh d + 2G(f ) + h 2 d 4 G( ∇f 2 ) + h √ d x∈S (f (x) ∇f 2 ) IMSE( fS (x)) ≤ 12(1 -ρ h ) 2 + 3 1 nh d + G(f ) n + o 1 n + h 2 d 4 G( ∇f 2 )) + n nzp -1 KRnh d + 3 1 + h 2 d 4 G( ∇f 2 )+ 2 G( ∇f 2 ) + 1 nh d + G(f ) n + o 1 n + n nzp -1 KRnh d + 2G(f ) + h 2 d 4 G( ∇f 2 ) + h √ d x∈S (f (x) ∇f 2 )dx IMSE( fS (x)) ≤12(1 -ρ h ) 2 + 3(1 + 2 )( 1 nh d + G(f ) n + o( 1 n ) + n nzp -1 KRnh d )+ 3(1 + 3 ) h 2 d 4 G( ∇f 2 ))+ 3 (1 + 2G(f ) + h √ d x∈S (f (x) ∇f 2 )) .

C OTHER BASE LINES

Coresets: We considered a comparison with sophisticated data summaries such as coresets. Briefly, a coreset is a collection of (possibly weighted) points that can be used to estimate functions over the dataset. To use coresets to generate a synthetic dataset, we would need to estimate the KDE. Unfortunately, coresets for the KDE suffer from practical issues such as a large memory cost to construct the point set. Despite recent progress toward coresets in the streaming environment Phillips & Tai (2020), coresets remain difficult to implement for real-world KDE problems Charikar & Siminelakis (2017) . Clustering and Importance Sampling: Another reasonable strategy is to represent the dataset as a collection of weighted cluster centers, which may be used to compute the KDE and sample synthetic points. Unfortunately, algorithms such as k-means clustering are inappropriate for large streaming datasets and do not have the same mergeability properties as our sketch. Furthermore, such techniques are unlikely to substantially improve over random sampling when the samples is spread sufficiently well over the support of the distribution. An alternative approach is to select points from the dataset based on importance sampling Charikar & Siminelakis (2017) , geometric properties Cortes & Scott (2016) , and other sampling techniques Chen et al. (2012) . However, recent experiments show that for many real-world datasets, random samples have competitive performance when compared to point sets obtained via importance sampling and cluster-based approaches Coleman & Shrivastava (2020). Dimensionality Reduction: One can also apply sketching algorithms to compress a dataset by reducing the dimension of each data point via feature hashing, random projections or similar methods Achlioptas (2003) . However, this is unlikely to perform well in our evaluation since our datasets are already relatively low-dimensional. Such algorithms also fail to address the streaming setting, where N can grow very large, because the size of the compressed representation is linear in N . Finally, most dimensionality reduction algorithms do not easily permit the generation of more synthetic data in the original metric space.

D DIFFERENTIALLY PRIVATE DENSITY SKETCHES

In order to make the density sketch differentially private, we add noise to the distribution stored by density sketch. This is achieved by adding noise to the underlying count sketch array (K × R matrix of integers). Let the function mapping histogram of the data to the density sketch (before the heap construction) be denoted as f : N |X| -→ Z KR where X is the set of all partitions. We fill first define an discrete analog of laplacian noise. Definition 1 (Double geometric distribution). The double geometric distribution parameterized by p ∈ (0, 1) is defined as follows on the support of all integers. P (z|p) = 1 2 -p (1 -p) |z| p Algorithm to make Density Sketches private: Each cell of sketch (K × R) matrix is added an i.i.d noise drawn from the double geometric distribution. We will prove that this noise addition makes the function M = f + noise differentially private. Heap construction can be considered as an post processing operation on the density sketch matrix. Hence, the sampling distribution is then



Histogram divides the support S ⊂ R d of the data into multiple partitions. It then uses the counts in every partition to predict the density, fH (x), at a point x. Formally the density predicted at the point x ∈ S is given by fH (x) = C(bin(x)) nV(bin(x)) where bin(x) identifies the partition of x, C(b) and V(b) measures the the number of samples in partition b and the volume of partition b respectively. fH (x) integrates to 1 and hence fH (x) is also an estimate of the underlying density function f (x). Regular histograms use hyper-cube partitions of width B aligned with the data axes. As B increases, the bias of the estimate increases, and its variance decreases. Histograms suffer from bin-edge problems where a slight change in data across the bin's edge can change predictions significantly. Kernel Density Estimation(KDE): KDE provides a smoother estimate of f (x) which resolves the bin-edge problem of histograms. For a positive semi-definite kernel function k(x, y) : R d ×R d → R

Figure 1: Countsketch, sketching, and query

Figure 2: Overview of the sketching algorithmThe histogram has an exponential (in d) number of partitions. Hence, in high dimensions, it is impractical to store histograms. However, most high dimensional real data is clustered and thus has highly sparse histograms. This does not help with histograms, as post-pruning of histograms still requires us to build and enumerate them. Nevertheless, the sparsity in histogram makes it a good candidate for heavy hitter problems. We use CS, M, to store a compressed version of the histogram. Unfortunately, sampling with just M does not have an efficient solution. We maintain a set of heavy partitions for sampling in the min-heap H. We will discuss sampling in later subsections.

Constructing density sketch of f (x) Result: Density Sketch (DS) f (x) : R d → R : true distribution x 1 , . . . x n ∼ f (x) : sample drawn from f (x) bin(x) : S → N d : partition function M : CS with range R, repetitions K H(H) : min-heap to store top H partitions for i ← 1 to n do b = binx i M.insert(b, 1) c = M.query(b) H.update(b, c) Algorithm 2: query fC (y), y ∈ R d Result: fC (y) y ∈ R d b = biny c = M.query(b) return (c/(nV(b))) Algorithm 3: sample y ∈ R d such y ∼ fS (x) Result: y: sample from f S (y) P : categorical distribution over bins s.t.

Figure 4: Visualization of samples drawn from Density Sketch. (a-b) DS captures density information. (c) Higher power LSH functions lead to fine DS behaving like random sample (d) Coarser partitions lead to samples that resemble "avg" of samples in data

Figure5: (a-d) Performance of density sketches, Liberty coresets and random sample on downstream classification task. The bar widths refer to 2 × std. DS performs consistently better at all memory sizes (e) Performance of DS is stable with K,R configs for decent budget KR=1000000, (f) Optimal B exists for DS performance. larger and smaller values of B lead to performance degradation dimensionality reduction (it can be d× at best, where d is the dimension). So we cannot compare against this method. For a more detailed discussion on baselines, refer to appendix D. Results: The DS has parameters: partition function(bandwidth B). We use L2-LSH partitions, sketch parameters K, R, and heap parameter H in all our experiments. The memory of the DS used for sampling is affected by only the heap parameter (see appendix F for details on memory computation). In figure5(f), we use the config (K=4,R=250000,H=100000) and vary B. It is clear from the figure that B=0.001 works best for these datasets. Lower and higher values of B affect the performance adversely. Larger B implies that we will capture more space than needed in a single partition, and smaller B implies that we will capture lesser data in the heap. So it is expected that a sweet spot for B exists. In figure5(e), we fix (B=0.01,H=100000, KR=1000000) and vary K from 1 to 64. We can see that for a reasonable memory budget the results are stable with varying K. For the experiments in figure5(a-d) we fix (B=0.01, K=5, R=250000) and vary H. This gives us DS of different sizes. We plot test accuracy and losses for DS, random samples and Liberty coresets for different sizes of memory used. The width of the band signifies the 2 × std-dev of performance on three independent runs. As can be seen, for the "Higgs" dataset, the model's accuracy achieved on original data of size 2.5GB can be closely reached by using a DS of size 50MB. So we get around 50x compression ! We see similar results for datasets of "Susy" (100x compression, 0.8GB) as well. The results show that DS is much more informative than Random Sample and Liberty Coresets. For more details on running the experiments (data processing, memory measurements, etc.), refer to Appendix F.

THEOREM 3: f C TO f H While estimating true distribution f (x) : R d → R, the integrated mean square error (IMSE) for the estimator fC (x) using regular histogram with width h, number of samples n, and countsketch with range R, repetitions K and mean recovery, is IMSE( fC ) = IMSE( fH ) + n nzp KRnh d where n nzp is the number of non-zero partitions. Specifically, we have IV( fC ) = IV( fH ) + n nzp -1 KRnh d and ISB( fC ) = ISB( fH ) where n nzp is the number of non-zero bins/partitions.

b Ĉ(b) and n = b C(b) n and its relation to n: Let us first analyse n and how it is related to n.

bin(x) for different partitioning schemes



annex

DW Scott and SR Sain. Multi-dimensional density estimation handbook of statistics vol 23 data mining and computational statistics ed cr rao and ej wegman, 2004.Zhipeng Wang and David W Scott. Nonparametric density estimation for high-dimensional data-algorithms and applications. Wiley Interdisciplinary Reviews: Computational Statistics, 11(4):e1461, 2019.Hence, we pluggin in the values in previous equation ,Using Chebyshev's inequality , we haveHence with probability (1 -δ), δ = nnzp 2 nR , n is within multiplicative error.relation of pointwise Bias and ISB With probability 1 -δ, fC (x)As expectations respect inequalitiesIntegrating expressions again respects inequalitiesUsing first order taylor expansion of 1 1+ and ignore square termsdifferentially private. (Note that heap construction algorithm also needs to be modified in practical settings to ensure that it carries the differential privacy properties. But this is achievable) Theorem 2 (Differential privacy). The density sketches constructed with addition of double geometric noise with p = 1-e -/K where K is the number of repetitions in the sketch is ( , 0) differentially private.Proof. Consider the l1 metric for computing the distance between datasets. Consider any arbitrary pair x,y which satisfy x -y 1 = 1. In the histogram view of data, it is easy to check that a distance of 1 can exist if and only if there is an additional row in either x or y and all other data points are same. Without loss of generality we can write x = y ∪ {d} where d is the extra data point.As the constructed count sketch does not depend on the order of insertion, we can say that count sketch for x, i.e. f(x), is obtained from count sketch for y by sketching additional data point into it. Also, because of countsketch's mergeable property, we can writeAs sketching a single entry changes exactly one element of each row of countsketch by 1. f ({d}) 1 = K. Hence sensitivity of the function f is ∆f = K We use the double geometric distribution as defined above for noise.Now Let us consider the privacy achieved with this error. Let M be the final randomized algorithm with computation of f and adding noise. We are interested in the following quantity with x,y such that x -y 1 = 1.As l1-norm is a distance metric we can writeIf we put p = 1 -e -/∆(f )Hence M(x) is ( , 0)differentially private. Hence we have that the countsketch produced by the sketching algorithm with added double geometric noise is ( , 0)differentially private when we have p = 1 -e -/K why heaps are differentially private? If the data is bounded in R d (d is the dimension of the data), then it is easy to check that there is a cell in R d , which contains all the data, It follows that the number of partitions inside this cell is finite. So we can consider heap construction as iteratively going through each partition and noting down its count. Once we do that, we sort all the partitions according to the counts and keep top H elements. In this sense, we can consider heap construction as a post processing over count sketch. From the proposition 2.1 [Dwork, Roth], we know that post processing maintains differential privacy. Hence the heap we create is ( , 0) differentially private

