DENSITY SKETCHES FOR SAMPLING AND ESTIMATION Anonymous authors Paper under double-blind review

Abstract

There has been an exponential increase in the data generated worldwide. Insights into this data led by machine learning (ML) have given rise to exciting applications such as recommendation engines, conversational agents, and so on. Often, data for these applications is generated at a rate faster than ML pipelines can consume it. In this paper, we propose Density Sketches(DS) -a cheap and practical approach to reducing data redundancy in a streaming fashion. DS creates a succinct online summary of data distribution. While DS does not store the samples from the stream, we can sample unseen data on the fly from DS to use for downstream learning tasks. In this sense, DS can replace actual data in many machine learning pipelines analogous to generative models. Importantly, unlike generative models, which do not have statistical guarantees, the sampling distribution of DS asymptotically converges to underlying unknown density distribution. Additionally, DS is a one-pass algorithm that can be computed on data streams in compute and memory-constrained environments, including edge devices.

1. INTRODUCTION

With the advent of big data, the rate of data generation is exploding. For instance, Google has around 3.8 million search queries per minute, amounting to over 5 billion data points or terabytes of data generated daily. Any processing over this data, such as using the training of a recommendation model, would suffer from data explosion. By the time existing data is consumed, newer data is already available. In such cases, we need to discard a lot of data. One of the critical research directions is how to reduce data storage. In this paper, we present Density Sketches (DS): an efficient and online data structure for reducing redundancy in data. Often data comes from an underlying unknown distribution, and one of the challenges in data reduction is maintaining this distribution. In DS, we approximately store the data distribution in the form of a sketch. Using this DS, we can perform point-wise density estimation queries. Additionally, we can sample synthetic data from this sketch to use in downstream machine learning tasks. This paper shows that data sampled from DS asymptotically converges to the underlying unknown distribution. We can also view density sketches through the lens of coresets. Specifically, DS is a compressed version of grid coresets. Grid coresets are the oldest form of coresets, giving lower additive errors than modern coresets. However, grid coresets are generally prohibitive as they are exponential in dimension (d) . DS enables us to approximate grid coresets with the dependence of memory usage depending on the actual variety in the data instead of being exponential in d. Also, DS provides a streaming construction for this coreset. In this paper, we focus more on the density estimation and sampling aspects of DS. Sampling from a distribution described using data requires estimating the underlying distribution. Popular methods to infer the distribution and sample from it belong to the following three categories: 1. Parametric density estimation (Friedman et al., 2001) 2. Non-parametric estimation -Histograms and Kernel Density Estimators (KDE) (Scott, 2015) 3. Learning-based approaches such as Variational Auto Encoders (VAE), Generative Adversarial Networks (GANs), and related methods (Goodfellow et al., 2014; 2016) . Generally, parametric estimation is not suitable to model most real data as it can lead to significant, unavoidable bias from the choice of the model (Scott, 2015) . Learning the distribution, e.g., via neural networks, is one solution to this problem. Although learning-based methods have recently found remarkable success, they do not have any theoretical guarantees for the distribution of generated samples. Histograms and KDEs, on the other hand, are theoretically well understood. These statistical estimators of density are known to uniformly converge to the underlying true distribution almost surely. This paper focuses on such estimators, which have theoretical guarantees. Storage of histograms and sampling from them is expensive because of an exponential number of partitions (also known as bins). Apart from this, histograms also suffer from the bin-edge problem: a slight variation in data can lead to significant differences in the estimation of densities. KDEs are used to solve the bin-edge problem. KDE gives a smoother estimate of density. While sampling from a KDE is efficient, KDE is expensive to store. KDE requires us to store the entire data. Coresets for KDE are a good solution to the storage problems of KDE. However, the construction of coresets is typically quite expensive. In this work, we propose Density Sketches(DS) -a compressed sketch of density constructed in an efficient streaming manner. DS does not store actual samples of the data. But we can still efficiently produce samples from a KDE for specific kernels, which, in turn, approximates f (x). Being a compressed sketch, we can tune the accuracy-storage trade-off of DS, and we analyze this trade-off in the theorem 1.

2. PROBLEM STATEMENT AND RELATED WORK

Problem Statement: Formally, we want to create a data structure that has the following properties : (1) It sketches density information. (2) The sketch size is much smaller than the data size and does not scale linearly with it. (3) The construction is streaming and efficient. (4) We do not store any samples in the data structure created (for privacy reasons). ( 5) We want the sampling distribution, say fS (x), obtained by sampling from these data structures to approximate the true underlying distribution f (x). The problem we aim to solve can be considered a data reduction problem and has been widely pursued in literature. The set of existing approaches can be broadly classified into two sections. (1) Sampling based / Coresets : Approaches such as clustering/importance sampling (Charikar & Siminelakis, 2017; Cortes & Scott, 2016; Chen et al., 2012) and coresets for KDE (Phillips & Tai, 2020; 2018) fall under this category. These approaches aim to find a small set of possibly weighted samples for a specific objective function such that the result obtained by applying the function to this small set is within a small approximation error of the result obtained by applying the objective function on a complete dataset. The issue with these approaches is that of efficiency. Most of these algorithms require complicated computation over the entire data. Some streaming algorithms were recently proposed for coresets for KDE (Karnin & Liberty, 2019) . However, even these algorithms need to perform O(m) (m is compactor size) computationally expensive operations per sample for large chunks of size (m), making them unsuitable for our purposes. (2) Dimensionality reduction: These approaches aim to reduce the width of the data matrix. Approaches such as Principle Component Analysis (PCA) are computationally expensive and require iterative computation over the entire dataset. Random projections are an efficient streaming algorithm for dimensionality reduction. However, this approach leads to compressed data that increases linearly with the original data size. As we can see, existing approaches fall short of the requirements in our problem statement. 



3.1 HISTOGRAMS AND KERNEL DENSITY ESTIMATIONHistograms and KDE(Scott, 2015; Scott & Sain, 2004)  are popular methods to estimate the density of a distribution given a finite i.i.d. sample of n points in R d drawn from the true density, say f (x). Histogram: Histogram divides the support S ⊂ R d of the data into multiple partitions. It then uses the counts in every partition to predict the density, fH (x), at a point x. Formally the density predicted at the point x ∈ S is given by fH (x) = C(bin(x)) nV(bin(x)) where bin(x) identifies the partition of x, C(b) and V(b) measures the the number of samples in partition b and the volume of partition b respectively. fH (x) integrates to 1 and hence fH (x) is also an estimate of the underlying density function f (x). Regular histograms use hyper-cube partitions of width B aligned with the data axes. As B increases, the bias of the estimate increases, and its variance decreases. Histograms suffer from bin-edge problems where a slight change in data across the bin's edge can change predictions significantly. Kernel Density Estimation(KDE): KDE provides a smoother estimate of f (x) which resolves the bin-edge problem of histograms. For a positive semi-definite kernel function k(x, y) : R d ×R d → R

