DIFFERENTIALLY PRIVATE L 2 -HEAVY HITTERS IN THE SLIDING WINDOW MODEL

Abstract

The data management of large companies often prioritize more recent data, as a source of higher accuracy prediction than outdated data. For example, the Facebook data policy retains user search histories for 6 months while the Google data retention policy states that browser information may be stored for up to 9 months. These policies are captured by the sliding window model, in which only the most recent W statistics form the underlying dataset. In this paper, we consider the problem of privately releasing the L 2 -heavy hitters in the sliding window model, which include L p -heavy hitters for p ≤ 2 and in some sense are the strongest possible guarantees that can be achieved using polylogarithmic space, but cannot be handled by existing techniques due to the sub-additivity of the L 2 norm. Moreover, existing non-private sliding window algorithms use the smooth histogram framework, which has high sensitivity. To overcome these barriers, we introduce the first differentially private algorithm for L 2 -heavy hitters in the sliding window model by initiating a number of L 2 -heavy hitter algorithms across the stream with significantly lower threshold. Similarly, we augment the algorithms with an approximate frequency tracking algorithm with significantly higher accuracy. We then use smooth sensitivity and statistical distance arguments to show that we can add noise proportional to an estimation of the L 2 norm. To the best of our knowledge, our techniques are the first to privately release statistics that are related to a sub-additive function in the sliding window model, and may be of independent interest to future differentially private algorithmic design in the sliding window model.

1. INTRODUCTION

Differential privacy (Dwork, 2006; Dwork et al., 2016) has emerged as the standard for privacy in both the research and industrial communities. For example, Google Chrome uses RAPPOR (Erlingsson et al., 2014) to collect user statistics such as the default homepage of the browser or the default search engine, etc., Samsung proposed a similar mechanism to collect numerical answers such as the time of usage and battery volume (Nguyên et al., 2016) , and Apple uses a differentially private method (Greenberg, 2016) to generate predictions of spellings. The age of collected data can significantly impact its relevance to predicting future patterns, as the behavior of groups or individuals may significantly change over time due to either cyclical, temporary, or permanent change. Indeed, recent data is often a more accurate predictor than older data across multiple sources of big data, such as stock markets or Census data, a concept which is often reflected through the data management of large companies. For example, the Facebook data policy (Facebook) retains user search histories for 6 months, the Apple differential privacy (Upadhyay, 2019) states that collected data is retained for 3 months, the Google data retention policy states that browser information may be stored for up to 9 months (Google), and more generally, large data collection agencies often perform analysis and release statistics on time-bounded data. However, since large data collection agencies often manage highly sensitive data, the statistics must be released in a way that does not compromise privacy. Thus in this paper, we study the (event-level) differentially private release of statistics of time-bounded data that only use space sublinear in the size of the data. Definition 1.1 (Differential privacy (Dwork et al., 2016) ). Given ε > 0 and δ ∈ (0, 1), a randomized algorithm A operating on datastreams is (ε, δ)-differentially private if, for every pair of 1

