ALMOST LINEAR CONSTANT-FACTOR SKETCHING FOR ℓ 1 AND LOGISTIC REGRESSION

Abstract

We improve upon previous oblivious sketching and turnstile streaming results for ℓ 1 and logistic regression, giving a much smaller sketching dimension achieving O(1)approximation and yielding an efficient optimization problem in the sketch space. Namely, we achieve for any constant c > 0 a sketching dimension of Õ(d 1+c ) for ℓ 1 regression and Õ(µd 1+c ) for logistic regression, where µ is a standard measure that captures the complexity of compressing the data. For ℓ 1 -regression our sketching dimension is near-linear and improves previous work which either required Ω(log d)-approximation with this sketching dimension, or required a larger poly(d) number of rows. Similarly, for logistic regression previous work had worse poly(µd) factors in its sketching dimension. We also give a tradeoff that yields a 1 + ε approximation in input sparsity time by increasing the total size to (d log(n)/ε) O(1/ε) for ℓ 1 and to (µd log(n)/ε) O(1/ε) for logistic regression. Finally, we show that our sketch can be extended to approximate a regularized version of logistic regression where the data-dependent regularizer corresponds to the variance of the individual logistic losses.

1. INTRODUCTION

We consider logistic regression in distributed and streaming environments. A key tool for solving these problems is a distribution over random oblivious linear maps S ∈ R r×n which have the property that, for a given n × d matrix X, where we assume the labels for the rows of X have been multiplied into X, given only SX one can efficiently and approximately solve the logistic regression problem. The fact that S does not depend on X is what is referred to as S being oblivious, which is important in distributed and streaming tasks since one can choose S without first needing to read the input data. The fact that S is a linear map is also important for such tasks, since given SX (1) and SX (2) , one can add these to obtain S(X (1) + X (2) ), which allows for positive or negative updates to entries of the input in a stream, or across multiple servers in the arbitrary partition model of communication, see, e.g., (Woodruff, 2014) for a discussion of data stream and communication models. An important goal is to minimize the sketching dimension r of the sketching matrix S, as this translates into the memory required of a streaming algorithm and the communication cost of a distributed algorithm. At the same time, one would like the approximation factor that one obtains via this approach to be as small as possible. Specifically we develop and improve oblivious sketching for the most important robust linear regression variant, namely ℓ 1 regression, and for logistic regression, which is a generalized linear model of high importance for binary classification and estimation of Bernoulli probabilities. Sketching supports very fast updates which is desirable for performing robust and generalized regression in high-velocity data processing applications, for instance in physical experiments and other resource constraint settings, cf. (Munteanu et al., 2021; Munteanu, 2023) . We focus on the case where the number n of data points is very large, i.e., n ≫ d. In this case, applying a standard algorithm directly is not a viable option since it is either too slow or even becomes impossible when it requires more memory than we can afford. Following the sketch & solve paradigm (Woodruff, 2014) , our goal is in a first step to reduce the size of the data without losing too much information. Then, in a second step, we approximate the problem efficiently on the reduced data.

Sketch & solve principle:

1. Calculate a small sketch SX of the data X. 2. Solve the problem β = argmin β f (SXβ) using a standard optimization algorithm. The theoretical analysis proves that the sketch in the first step is calculated in such a way that the solution obtained in the second step is a good approximation to the original problem, i.e., that f (X β) ≤ C • argmin β f (Xβ) holds for a small constant factor C ≥ 1.

1.1. OUR CONTRIBUTIONS

For logistic regression our goal is to achieve an O(1)-approximation with an efficient estimator in the sketch space and smallest possible sketching dimension in terms of µ and d, where µ = µ(X) = sup β̸ =0 x i β>0 |xiβ| x i β<0 |xiβ| is a data dependent parameter that captures the complexity of compressing the data for logistic regression, see Definition 2.1. As a byproduct of our algorithms, we also obtain algorithms for ℓ 1 -regression. We note that the parameter µ is necessary only for logistic regression, i.e., for sketching ℓ 1 -regression, we set µ = 1. We summarize our contributions as follows: 1) We significantly improve the sketch of Munteanu et al. (2021) . More precisely we show with minor modifications in their algorithm but major modifications in the analysis that the size of the sketch can be reduced from roughly Õ(µ 7 d 5 )foot_0 to Õ(µd 1+c ) for any c > 0, while preserving an O(1) approximation to either the logistic or ℓ 1 loss. 2) We show that increasing the sketching dimension to (µd log(n)/ε) O(1/ε) is sufficient to obtain a 1 + ε approximation guarantee. 3) We show that our sketch can also approximate variance-based regularized logistic regression within an O(1) factor if the dependence on n in the sketching dimension is increased to n 0.5+c for any c > 0. We also give an example corroborating that the CountMin-sketch that we use needs at least Ω(n 0.5 ) rows to achieve an approximation guarantee below log 2 (µ).

1.2. RELATED WORK

Data oblivious sketching Data oblivious sketches have been developed for many problems in computer science, see (Phillips, 2017; Munteanu, 2023) for surveys. The seminal work of Sarlós (2006) opened up the toolbox of sketching for numerical linear algebra and machine learning problems, such as linear regression and low rank approximation, cf. (Woodruff, 2014) . We note that oblivious sketching is very important to obtain data stream algorithms in the turnstile model (Muthukrishnan, 2005) and there is evidence that linear sketches are optimal for such algorithms under certain conditions (Li et al., 2014; Ai et al., 2016) . The classic works on ℓ 2 regression have been generalized to other ℓ p norms (Sohler & Woodruff, 2011; Woodruff & Zhang, 2013) by combining sketching as a fast but inaccurate preconditioner and subsequent sampling to achieve the desired (1 + ε)-approximation bounds. Those works have been generalized further to so-called M -estimators, i.e., Huber (Clarkson & Woodruff, 2015a) or Tukey regression loss (Clarkson et al., 2019) , that share nice properties such as symmetry and homogeneity leveraged in previous works on ℓ p norms. ℓ 1 regression Specifically for ℓ 1 , the first sketching algorithms used random variables drawn from 1-stable (Cauchy) distributions to estimate the norm (Indyk, 2006) . It is possible to get concentration and a (1 ± ε)-approximation in near-linear space by using a median estimator. However, in a regression setting this estimator leads to a non-convex optimization problem in the sketch space. Since we want to preserve convexity to facilitate efficient optimization in the sketch space, we focus on sketches that work with an ℓ 1 estimator for solving the ℓ 1 regression problem in the sketch space in order to obtain a constant approximation for the original ℓ 1 problem. With this restriction, it is possible to obtain a contraction bound with high probability so as to union bound over a net, but similar results are not available for the dilation. Indeed, subspace embeddings for the ℓ 1 norm have Θ(d) dilation (Woodruff & Zhang, 2013; Li et al., 2021; Wang & Woodruff, 2022) . A 1 + ε dilation is only known to be possible when mapping to exp(O(1/ε)) dimensions (Brinkman & Charikar, 2005) , even for single vectors as in (Indyk, 2006) . We thus focus on obtaining an O(1) approximation in this paper. Previous work had either larger O(log(d)) distortionfoot_1 or larger poly(d) factors (Indyk, 2006; Sohler & Woodruff, 2011) . There exists a (1 + ε)-approximation algorithm (Sohler & Woodruff, 2011) for turnstile data streams, running two sketches in parallel: one for preconditioning and another that performs ℓ 1 -row-sampling from the sketch (Andoni et al., 2009) . However, it has a worse poly(d log(n)/ε) update time and sketching dimension, see (Sohler & Woodruff, 2011, Theorem 13 ). An advantage of our sketch is that it uses only random {0, 1}-entries, which have better computational and implicit storage properties (Alon et al., 1986; 1999; Rusu & Dobra, 2007) . More importantly, our approach works simultaneously for both, ℓ 1 and logistic regression. For the latter no near-linear sketching dimension was known to be possible since sketches for ℓ 1 cannot preserve the sign of coordinates, which is crucial for any multiplicative error on the asymmetric logistic loss. Generalized linear models (GLMs) It is important to extend the works on linear regression to more sophisticated and expressive statistical learning problems, such as generalized linear models (McCullagh & Nelder, 1989) . Unfortunately, taking this step led to impossibility results. Namely, approximating the regression problems on a succinct sketch for strictly monotonic functions such as logistic loss (Munteanu et al., 2018) or heavily imbalanced asymmetric functions such as Poisson regression loss (Molina et al., 2018) allows one to design a low-communication protocol for the INDEXING problem that contradicts its Ω(n) bit randomized one-way communication complexity (Kremer et al., 1999) . This implies an Ω(n) sketching dimension for these problems. To circumvent this worst-case limitation for logistic regression, Munteanu et al. (2018) introduced a natural data dependent parameter µ that can be used to bound the complexity of compressing data for logistic and probit regression (Munteanu et al., 2022) . This also led to the very first oblivious sketch for logistic regression (Munteanu et al., 2021) , with a polylogarithmic number of rows for mild data. We improve this by giving, the only near-linear sketching dimension in d and µ for logistic regression. The previous best sketching dimension obtained by Lewis weight sampling (Mai et al., 2021) , required O(µ 2 d) and crucially their sketch is not oblivious so cannot be implemented in a turnstile data stream, with positive and negative updates to the entries of the input point set. For lower bounds, an Ω(d) dependence is immediate since mapping to fewer than d dimensions contracts non-zero vectors in the null-space of the sketching matrix to zero. An Ω(µ) lower bound is immediate from Munteanu et al. (2018) and was recently generalized by Woodruff & Yasuda (2023) to more natural settings. Variance-based regularization Regularization techniques have been proposed in the literature for many purposes, such as reducing the effective dimension of statistical problems or limiting their expressivity to avoid overfitting. Regularization was also proposed to relax the logistic regression problem. In an extreme setting where the regularizer dominates the objective function, the contributions of data points do not differ significantly. The problem then becomes easy to approximate by uniform subsampling (Samadian et al., 2020) . To address the bias-variance tradeoff in machine learning problems in a more meaningful way and to provably reduce the generalization error of models, Maurer & Pontil (2009) proposed to add a data-dependent variance-based regularization. Since this results in a non-convex optimization problem even for convex objectives, Duchi & Namkoong (2019) ; Yan et al. (2020) used optimization tricks to reformulate a convex variant with additional parameters that can be integrated into standard hyperparameter tuning. Interestingly, this data-dependent regularization -in contrast to standard regularization -does not relax the sketching problem but makes it more complicated, requiring in the case of logistic regression a combination of ℓ 1 and ℓ 2 geometries to be preserved. We show that our sketch can deal with both simultaneously.

1.3. OUR TECHNIQUES

Our main motivation is to reduce the large dependence on the parameters of the oblivious sketching algorithm of Munteanu et al. (2021) . Their sketch consists of O(log n) levels that take subsamples at exponentially decreasing rate, and apply a CountMin-sketch to each subsample to compress it to roughly size Õ(d 5 (µ/ε) 7 ) which gives (1 -ε) contraction but only O(1)foot_2 dilation bounds. Our new methods significantly improve over their sketching dimension for obtaining (1 -ε) contraction bounds. The large dependence on µ came from adapting the analysis of Clarkson & Woodruff (2015a) to work for the asymmetric logistic loss function. This required to rescale ε ′ = ε/µ to translate the estimation error of ε ′ ∥z∥ 1 to an error of ε∥z + ∥ 1 , where the latter quantity sums only over the positive entries of z. We avoid this by noting that we can oversample the elements by a factor of µ to capture sufficiently many elements to approximate the required ε∥z + ∥ 1 error directly. However, the analysis of the so-called heavy hittersfoot_3 requires µ elements to be perfectly isolated when hashing them into buckets, which requires µ 2 buckets to succeed with good probability. To obtain a linear dependence on µ, we sacrifice some sparsity in our sketching matrix. Instead of hashing to a single bucket at each level, we hash each element multiple times. The best known trade-off between sketching dimension and sparsity is due to Cohen (2016) for the Count-sketch. We adapt the technique to our CountMin-sketch: we hash each element to roughly O(ε -1 log(µd/ε)) buckets and resolve collisions by summing the elements instead of taking a random sign combination. This brings the dependence on µ down to quasi-linear. The dependence on d and ε also benefit from this technique but at this point, our analysis still requires a d 2 dependence. This comes from needing to separate the heavy hitters from other large coordinates for all vectors in a net of exponential size in d. To bring the dependence on d down to near-linear we densify our sketch to roughly O(ε -3 µd log(n)) non-zeros per column, which separates the heavy hitters almost entirely and yields our result. We also improve the dilation bounds from a factor of ≥ 8 to a 2-approximation: previous analyses were conducted by bounding the expected contribution of weight classes W q = {i | 2 -q-1 < z i ≤ 2 -q } to the different levels in our sketch. A simple bound of O(log n) was improved to O(1) by a Ky-Fan norm argument, which cuts off elements that have a low redundant contribution. We reverse the perspective and ask for each level h, which weight classes are well represented? This allows us to conduct a more fine-grained analysis: we define intervals Q h = [q(2), q(3)] and Q h ⊆ Q ′ h = [q(1), q )] and quantify their size depending on the number of buckets, such that weight classes q / ∈ Q ′ h do not contribute at all, q ∈ Q ′ h make a non-negligible contribution, and q ∈ Q h are additionally well-approximated. It is thus desirable to choose the number of buckets in such a way that |Q ′ h |/|Q h | ≈ 1. Moreover, we ask how the intervals in consecutive levels overlap. It turns out that slightly increasing the number of buckets in each level to Õ(µd 2 ) allows us to show that each q appears only in at most two consecutive levels (in expectation) which yields a 2-approximation. Indeed, the argument can be continued by raising the size of the sketch to any power of k ∈ N, resulting in an expected contribution in at most (1 + 1/k) levels, which yields a (1 + ε)-approximation using (µd log(n)/ε) O(1/ε) rows. An exponential dependence on 1/ε is best known for sketching-based estimators of the ℓ 1 norm (Indyk, 2006; Li et al., 2021 ) that embed into lower-dimensional ℓ 1 , and our sketch can be used as such an estimator for ℓ 1 as a special case. Finally, as a corollary and important application of our results, we obtain similar oblivious sketching bounds for a variance-regularized version of logistic regression, see Section 1.2. It combines aspects of the ℓ 1 geometry of the sum of logistic losses with the ℓ 2 geometry that appears in the sum of squared logistic losses. The analysis is very similar to the standard logistic regression loss but requires redefining the weight classes in terms of squared values z 2 i and converting between the two norms, which introduces roughly another O( √ n)-factor. Previous work on data reduction methods for generalized linear models that work for different ℓ p losses were either based on sampling (Munteanu et al., 2022) or worked only for symmetric functions such as norms (Clarkson & Woodruff, 2015a) . However, an oblivious sketch for our loss function requires preserving the signs of elements which is not possible with previous sketching methods. Relying on the CountMin-sketch as in (Munteanu et al., 2021) thus seems necessary. For this choice we show that an additional Θ( √ n)-factor is unavoidable and hereby we corroborate the tightness of our analysis.

2. PRELIMINARIES AND MAIN RESULTS

For ℓ 1 regression, we consider as inputs a data matrix X ∈ R n×d and a target vector Y ∈ R n . The task is to find β ∈ argmin β∈R d ∥Xβ -Y ∥ 1 . We note that up to constants, this corresponds to minimizing the negative log-likelihood of a standard linear model Y = Xβ + η with a Laplace noise distribution η i ∼ L(0, 1) for all i ∈ [n]. Our goal will be to design an oblivious linear sketching matrix S such that the sketch X ′ = S[X, Y ] is significantly reduced in its number of rows and solving the compressed ℓ 1 regression problem in the sketch space yields an O(1) approximation to the same problem on the original large data. Up to slight modifications, this sketch will also allow us to approximate logistic regression within a constant factor. For logistic regression, assume that we are given a data set Z = {z 1 , . . . , z n } with z i ∈ R d for all i ∈ [n], together with a set of labels Y = {y 1 , . . . , y n } with y i ∈ {-1, 1} for all i ∈ [n]. In logistic regression the negative log-likelihood (McCullagh & Nelder, 1989 ) is of the form L(β|Z, Y ) = n i=1 ln(1 + exp(-y i z i β)), which, from a learning and optimization perspective, is the objective function that we would like to minimize. For r ∈ R we set ℓ(r) = ln(1 + exp(r)) to simplify notation. Then we have that L(β|Z, Y ) = n i=1 ℓ(-y i z i β). We also include a variance-based regularization as proposed in (Maurer & Pontil, 2009; Duchi & Namkoong, 2019; Yan et al., 2020) to decrease the generalization error. We view our data set as n realizations of a random variable (z, y), where each (z i , y i ) is drawn i.i.d. from an unknown distribution D. Then the expected value of the negative log-likelihood (on the empirical sample) for any fixed β equals E(ℓ(-yzβ)) = 1 n L(β|Z, Y ). The variance is given by Var(ℓ(-yzβ)) = E(ℓ(-yzβ) 2 ) -E(ℓ(-yzβ)) 2 = 1 n n i=1 ℓ(-y i z i β) 2 -1 n n i=1 ℓ(-y i z i β) 2 We also introduce a regularization hyperparameter λ ∈ R ≥0 . Then our objective is to minimize E(ℓ(-yzβ)) + λ 2 Var(ℓ(-yzβ)). As z i and y i always appear together, we set x i = -y i z i . Further we set X ∈ R n×d to be the matrix with row vectors x i for i ∈ [n]. For technical reasons we include a weight vector w ∈ R n ≥0 into the objective. Then our goal is to find β ∈ R d minimizing f w (Xβ) = 1 n n i=1 w i ℓ(x i β) + λ 2n n i=1 w i ℓ(x i β) 2 -λ 2 1 n n i=1 w i ℓ(x i β) 2 . The unweighted case corresponds to choosing w to be the vector containing only 1's, in which case we set f (Xβ) = f w (Xβ). We also note that f (Xβ) ≥ 1 n n i=1 ℓ(x i β) since the variance term is non-negative. Next, observe that min β∈R d f (Xβ) ≤ f (0) = ℓ(0) = ln(2). In our analysis we investigate functions ℓ(r) and ℓ(r) 2 . Further we split f into three functions f 1 (Xβ) = 1 n n i=1 ℓ(x i β), f 2 (Xβ) = λ 2n n i=1 ℓ(x i β) 2 , and f 3 (Xβ) = λ 2 1 n n i=1 ℓ(x i β) 2 = λ 2 f 1 (Xβ) 2 . In contrast to the ℓ 1 regression problem, a data reduction for f or even f 1 where the sketch size is r ≪ n cannot be obtained in general. There are examples where no sketch of size r = o(n/ log n) exists, even for an arbitrarily large but finite error bound (Munteanu et al., 2018) . If we require the sketch to be a subset of the input, the bound can be strengthened to Ω(n) (Tolochinsky et al., 2022) . Those impossibility results rely on the monotonicity of the loss function and thus extend to the function f studied in this paper. To get around these strong limitations, Munteanu et al. (2018) introduced a parameter µ as a natural notion for parameterizing the complexity of compressing the input matrix X for logistic regression. It was recently adapted for p-generalized probit regression (Munteanu et al., 2022) . We work with a similar generalization given in the following definition. Definition 2.1. Let X ∈ R n×d be any matrix and let p ∈ [1, ∞). We define µ p (X) = sup β∈R d \{0} xiβ>0 |x i β| p xiβ<0 |x i β| p . We say that X is µ-complex if max{µ 1 (X), µ 2 (X)} ≤ µ. Our goal is to construct a slightly relaxed version of a sketch that suffices to obtain a good approximation by optimizing in the sketch space: Definition 2.2. Given a dataset (X, w), a subset V ⊂ R d , a > 1 and ε, δ > 0. A weak weighted (V, a, ε)-sketch C = (X ′ , w ′ ) for f is a matrix X ′ ∈ R r×d together with a weight vector w ′ ∈ R r >0 such that it holds simultaneously that: For all β ∈ V we have f w ′ (X ′ β) ≥ (1 -ε)f w (Xβ) and for β * ∈ V minimizing f w (Xβ) it holds that f w ′ (X ′ β * ) ≤ af w (Xβ * ). Further for any β ∈ R d \ V it holds that f w ′ (X ′ β) > min β∈V f w ′ (X ′ β). We note that a sketch satisfying the definition for V = R d is known in the literature as a lopsided embedding (Sohler & Woodruff, 2011; Clarkson & Woodruff, 2015b; Feng et al., 2021) . We denote by nnz(X) the number of non-zero entries of X. Our main results are the following: For ℓ 1 regression, where the objective is ∥Xβ -Y ∥ 1 , we have Theorem 1. Let X ∈ R n×d and let Y ∈ R n . Let ε, δ > 0 and let a > 1. Then there is a distribution over sketching matrices S ∈ R r×n and a corresponding weight vector w ∈ R r , for which X ′ = S[X, Y ] can be computed in T time in a single pass over a turnstile data stream such that (X ′ , w) is a weak weighted (R d , α, ε)-sketch for ℓ 1 -regression with failure probability at most P , where 1. r = O(d 1+c ln(n) 3+5c ) for any constant 1 ≥ c > 0, T = O(d ln(n)nnz(X) ), and α = 1 + 1 c and P are constant, 2. r = O( d 4 ln(n) 5 δε 7 ) + 32d ln(n) 3 ε 5 • ( 64d ln(n) 5 ε 6 δ ) 1+ε -1 , T = O(nnz(X) ), α = (1 + aε), and P = δ + 1 a . For logistic regression where the objective function is only f 1 (Xβ), we have Theorem 2. Let 1 ≥ c > 0 be any constant. Let X ∈ R n×d be a µ-complex matrix for bounded µ ∈ O((d log 3 (n)) c ). Let ε, δ > 0 and let a > 1. Then there is a distribution over sketching matrices S ∈ R r×n and a corresponding weight vector w ∈ R r , for which X ′ = SX can be computed in T time in a single pass over a turnstile data stream such that (X ′ , w) is a weak weighted (R d , α, ε)-sketch for f 1 with failure probability at most P , where 1. r = O(µd 1+c ln(n) 2+4c ), T = O(µd ln(n)nnz(X)), and α = 1 + 1 c and P are constant, 2. r = O( d 4 ln(n) 5 µ 2 δε 7 ) + 32dµ ln(n) 2 ε 5 • ( 64d ln(n) 4 ε 7 δ ) 1+ε -1 , T = O(nnz(X)), α = (1 + aε), and P = δ + 1 a . Note that setting a = δ -1 and substituting ε with εδ in the second item yields a (1 + ε) approximation with probability at least 1 -2δ. For the variance-based regularization, where we consider the full objective function f (Xβ), we have Theorem 3. Let X ∈ R n×d be a µ-complex matrix for bounded µ < n. Let ε, δ > 0, let a > 1 and set V = {Xβ | f 1 (Xβ) ≤ ln(2)(1 -ε)}. Then there is a distribution over sketching matrices S ∈ R r×n and a corresponding weight vector w ∈ R r , for which X ′ = SX can be computed in T time in a single pass over a turnstile data stream such that (X ′ , w) is a weak weighted (V, α, ε)-sketch for f with failure probability at most P , where • r = O( n 0.5+c µd 2 ln 3 (n) ε 5 •max{d, ln(n), ε -1 , δ -1 , µ}+ d 5 µ 2 ln(n) 5 √ n δε 7 ), for arbitrary constant 1 ≥ c > 0, T = O(nnz(X)), α = 1 + a c , and P = δ + 1 a . We note that for generality of our results we specify a tradeoff between our d 1+c dependence and an arbitrarily large constant approximation error α = 1 + 1/c. We stress that specific parameterizations yield strictly improved results over previous work. For instance we improve the ≥ 8-approximation of (Munteanu et al., 2021) within Õ(µ 7 d 5 ) to a 2-approximation within Õ(µd 2 ) by choosing c = 1. We further improve several lopsided ℓ 1 → ℓ 1 embedding results to a factor of 2 where previous approximations gave only O(d log d) (Sohler & Woodruff, 2011) , O(log d) (Woodruff, 2021) , or ≥ 8 (Clarkson & Woodruff, 2015a) or gave only non-convex estimators in the sketch space (Backurs et al., 2016) along with larger superlinear dependencies on d. Technical description of the sketch Our contributions lie mainly in the improved and refined theoretical analyses. The sketching matrices of Theorems 1-3 are the same as in (Munteanu et al., 2021) up to small but important algorithmic modifications specified in the textual description below, and in pseudo-code, see Algorithm 1 in the appendix. The sketching matrix consists of O(log n) levels. In each level we take a subsample of all rows i ∈ [n] at a different rate and hash the sampled items uniformly to a small number of buckets. All items that are mapped to the same bucket are summed up. This corresponds to a CountMin sketch (Cormode & Muthukrishnan, 2005) applied to the subsample taken at each level. More specifically, we will use the following parameters: • h m : the number of levels, • N h : the number of buckets at level h, • p h : the probability that any element x i is sampled at level h. As we read the input, we sample each element x i for each level h ≤ h m with probability p h . The sampling probabilities are exponentially decreasing, i.e., p h ∝ 1/b h for some b ∈ R with b > 1. The weight of any bucket at level h is set to 1/p h . At level h m , we have p hm ∝ 1 n . It thus corresponds to a small uniform subsample and the number of buckets is equal to the number of rows that are sampled, i.e., N u := N hm ≈ np hm =: np u . At level 0 we sample all rows, i.e., p 0 = 1 and the number of buckets is either the same as for the levels h ∈ (0, h m ) or less. Consequently level 0 is a standard CountMin sketch of the entire data. All levels h ∈ (0, h m ) have the same number of buckets N h = N . For obtaining subquadratic dependence on µ, d in item 1 of Theorems 1 and 2, the sketch at level 0 is densified, which means that each element is hashed to a number of s > 1 buckets. The idea of the sketching algorithm is that for each fixed β ∈ R d we partition the coordinates of Xβ into weight classes depending on their contribution to the objective function. Each level approximates well a certain range of weight classes if their total contribution is large enough. For example the highest level h m will cover all the elements in small weight classes and the lowest level 0 will capture the so-called heavy hitters that appear rarely but have a significant contribution to the objective. Another algorithmic change is randomizing the size of the sketch at level 0, which is crucial for obtaining a (1 + ε)-approximation. For the exact details we refer to the analysis and Assumption A.1.

3. HIGH LEVEL DESCRIPTION OF OUR NOVEL ANALYSIS

Several details of the analysis, such as assumptions on the parameters, technical lemmas, and proofs are deferred to the appendix. We start by splitting the functions f 1 and f 2 into multiple parts: Lemma 3.1. It holds that nf 1 (Xβ) = xiβ>0 |x i β| + n i=1 ℓ(-|x i β|) and similarly we have that nf 2 (Xβ) = xiβ>0 |x i β| 2 + 2 xiβ>0 ℓ(-|x i β|) • |x i β| + n i=1 ℓ(-|x i β|) 2 . This can be used in the following way: if all x i β make only small contributions then uniform sampling performs well. This is not the case for all parts of f but it holds for some 'small' parts of f that appear in the splitting introduced in Lemma 3.1. Next we deal with the remaining 'large' parts of f . We will first analyze the approximation for a single β. To this end fix β ∈ R d and set z = Xβ. Our goal is to approximate ∥z + ∥ 1 := i:zi>0 z i where z + ∈ R n ≥0 is the vector that we get by setting all negative coordinates of z to 0. We assume w.l.o.g. that ∥z∥ 1 = 1. We can do this since v → ∥v + ∥ 1 is absolutely homogeneous. In order to prove that ∥(Sz) + ∥ 1 approximates ∥z + ∥ 1 well, we define weight classes: given q ∈ N we set W + q = {i ∈ [n] | z i ∈ (2 -q-1 , 2 -q ]}. Our analysis applies with slight adaptations to ℓ 1 regression preserving ∥z∥ 1 for the residual vector z = Xβ -Y . The analysis is entirely in the appendix due to the page limitations. We give a high level description for preserving ∥z + ∥ 1 needed for logistic loss.

Contraction bounds We set

q m = log 2 ( n(µ+1) ε ) = O(ln(n)) since n ≥ max{µ, ε -1 }. We say that W + q is important if ∥W + q ∥ 1 ≥ ε ′ := ε µqm and set Q * = {q ≤ q m | W + q is important }. The idea is that the remaining weight classes can only have small contributions to ∥z + ∥ 1 , so it suffices to analyze Q * . To prove the contraction bound for z, i.e., that ∥(Sz) + ∥ 1 ≥ (1 -cε)∥z + ∥ 1 holds for an absolute constant c, it suffices to show that the contributions of important weight classes are preserved. For a bucket B we set G(B) := j∈B z j and G + (B) = max{G(B), 0}. In fact, we show that for each level h, there exists an 'inner' interval Q h = [q h (2), q h (3)] such that if W + q for q ∈ Q h is important, then there exists a subset W * q ⊆ W + q such that each element of W * q is sampled at level h and such that i∈W * q G(B i ) ≥ (1 -ε)∥W + q ∥ 1 • p h , where B i is the bucket at level h containing z i . Since the weight of all buckets at level h is equal to p -1 h we have that the contribution of W * q is indeed at least (1 -ε)∥W + q ∥ 1 . The choice of our parameters then guarantees that Q h = N and thus for any important weight class there is at least one level where it is well represented. Finally, we construct a net of size |N k | = exp(O(d log(n))). We ensure that the contraction bound holds for each fixed net point z ∈ N k with failure probability at most δ |N k | which will dominate -among other parameters -the size of our sketch. By a union bound, the contraction result holds for the entire net with probability at least 1 -δ. The net is sufficiently fine, such that we can conclude the contraction bound by relating all other points z = Xβ ∈ R n to their closest net point. Dilation bounds We will also show that the expected contribution of any weight class is at most 2∥W + q ∥ 1 or even less. To this end we increase the number of buckets N and apply a random shift at level 0, i.e., we choose the number of buckets at level 0 randomly. We investigate again each level separately and prove that for each level h there exists an 'outer' interval Q ′ h = [q h (1), q h (4)] such that for any q / ∈ Q ′ h the weight class W q makes no contribution at level h at all. More specifically we show that no element of W q appears at level h for q < q h (1) and that for any bucket B at level h that contains only elements of q>q h (4) W q it holds that G(B) ≤ 0. Then we show that if N is large enough it holds that for each q ∈ N there are at most two levels h such that q ∈ Q ′ h and that the expected contribution of any weight class at any level is bounded by ∥W + q ∥ 1 . We conclude that the expected contribution of any weight class is at most 2∥W + q ∥ 1 . Increasing the size of N increases the size of the 'inner' interval [q h (2), q h (3)] =: Q h ⊂ Q ′ h while the size of Q ′ h remains (almost) unchanged such that |Q ′ h |/|Q h | approaches 1. As a consequence, this also decreases the number of indices q ∈ N that appear in two intervals of the form Q ′ h . More precisely, we show that for each k ∈ N we can increase N in such a way that only a 1/k fraction of the weight classes appear in two of those intervals. Note that all weight classes that appear only in a single Q ′ h have an expected contribution of ∥W + q ∥ 1 . Recall that all indices are considered on level 0. This is handled by applying a random shift, implicitly setting q 0 (3) randomly in an appropriate way such that the expected contribution of any weight class W + q is bounded by at most (1 + 1/k)∥W + q ∥ 1 . Extension to variance-based regularized logistic regression We show that our algorithm also approximates the variance well under the assumption that roughly f 1 (Xβ) ≤ ln(2). We stress that this assumption does not rule out the existence of good approximations. Indeed, even the minimizer is contained as observed in the preliminaries, since we have that min β∈R d f (Xβ) ≤ f (0) = f 1 (0) = ln(2). Focusing on a single z = Xβ, we need to show that i:zi>0 z 2 i is approximated well, which is done very similarly to the analysis for i:zi>0 z i sketched above, but with several adaptions to account for the squared loss function. We note that the increased sketching dimension in terms of √ n comes from the inter norm inequality ∥x∥ 1 ≤ √ n∥x∥ 2 . Lemma E.14 in the appendix shows that this dependence can not be avoided using the CountMin-sketch. It does not rule out other methods that may allow a lower sketching dimension. We stress that other known standard sketches do not work for asymmetric functions since they confuse the signs of contributions leading to unbounded errors for our objective function or plain logistic regression, see (Munteanu et al., 2021) .

4. EXPERIMENTS

We implemented our new sketching algorithm into the framework of Munteanu et al. (2021) 5 . Pseudocode can be found in Appendix G. The crucial difference is that at level 0 of our sketch, each element gets mapped to multiple buckets instead of only one. Sketch (old) denotes the sketch used in (Munteanu et al., 2021) and is highlighted in red in the plots. Sketchs is the sketch where each entry is mapped to s ∈ {2, 5, 10} buckets at level 0. Each sketch was run with 40 repetitions for various target sizes. Real-world benchmark data was downloaded automatically by the Python scripts: the Covertype data consists of 581, 012 cartographic observations of different forests with 54 features. The Webspam data consists of 350, 000 unigrams with 127 features from web pages. The Kddcup data consists of 494, 021 network connections with 41 features. On the real-world data we see in Figure 1 slightly improved performances over the previous sketch for Covertype and Webspam. On the Kddcup data we see a slightly weaker performance. We also see that increasing the sparsity parameter s too much results in a worse performance, which is especially true for λ > 0 (partly in the appendix). This indicates that the variance term is large for Kddcup and the √ n dependence dominates the sketching size necessary to decrease the error in the squared variance-regularization term. We created a synthetic data set that has multiple heavy hitters for the sake of showing the benefits of the new sketch. A detailed description of our construction can be found in Appendix G, along with intuition why it is complicated for the old sketch, while our new sketch can handle it much better. The data set consists of n = 40, 000 points and the dimension is d = 100. We see in Figure 1 (bottom left) that the increase in the number of buckets each elements is hashed to improves Figure 1 : Comparison of median approximation ratios of the old sketch vs. the new sketch with various settings for the sparsity s ∈ {2, 5, 10} as well as for the regularization parameter λ ∈ {0, .1} for real-world benchmark data (rows 1-2). Comparison of median approximation ratios and sketching times (row 3, left, middle) for our synthetic data. Comparison to the Cauchy sketch (row 3, right). from approximation ratios between 7 and 8 for the old sketch to 3 or even 2 approximations for the modified sketches. The sketching times are only slightly increased for larger values of s, allowing for fast processing time (bottom middle). We omitted the time plots for the other data sets and parameterizations since the general picture is consistently as expected: Sketchs is almost s times slower than Sketch (old). We added another comparison between our Sketch and the Cauchy sketch for an ℓ 1 regression problem (bottom right). We see that the new sketch, using any degree of sparsity s ∈ {2, 5, 10}, outperforms the Cauchy sketch by a large margin in terms of approximation factor (while being a lot faster to apply than the dense matrix multiplication). More plots and discussion can be found in Appendix G. We also discuss stochastic gradient descent (SGD) together with supporting experiments in Appendix G. While SGD performs well on real-world data (though not better than sketching), it suffers from arbitrarily bad errors when applied to our synthetic data.

5. CONCLUSION

We obtain significantly improved bounds on the number of rows that are sufficient for obliviously sketching logistic regression on µ-complex data up to an O(1) factor. Our bounds are almost linear in terms of the dimension d and a data dependent complexity parameter µ that bounds the complexity of data reduction techniques for logistic regression and related loss functions. Our results are achieved by modifying the sketching approach of Munteanu et al. (2021) , which allows a change of perspective and facilitates a fine-grained analysis of the contributions of single levels in the sketch. As a result, we also develop the first oblivious sketch for obtaining a (1 + ε)-approximation, albeit with an exponential dependence on 1/ε which is likely to be required for our estimator due to corresponding hardness results on sketching ℓ 1 norms. We also extend the analysis to work for a variance-based regularized version of logistic regression which combines the ℓ 1 and ℓ 2 related loss functions and is of great practical relevance for reducing generalization error in statistical learning. It remains a challenging open question whether we can further reduce the upper bounds to or below O(µd) or increase the lower bounds from Ω(µ + d) to or above Ω(µd). It would be interesting to study our sketching techniques under assumptions such as sparsity that allow to get below linear sketching dimension (Mai et al., 2023) .

A OMITTED DETAILS FROM SECTION 2

For technical reasons we make the following assumption: Assumption A.1. We assume that: h m = min i ∈ N | M i bN ≤ 12 ln(n) (1) N ≥ 32m 1+c 1 q 1+c m h c m µ/ε 6 (2) b = N ε 5 32m 1 q m µ ≥ 18µ ε (3) m 1 = ln(δ -1 ) + O(d ln(n))) p u ≥ 64µm 1 ε 2 n . ( ) Here c is some constant used in the proof of Theorem 2. Since we want our sketch to have fewer than n rows we will also assume that n ≥ ε -1 , µ, d, δ -1 . We also assume that ε ≤ 1/4. We will further use the following probability tools: Proposition A.1. [Bernstein' s Inequality] (Bernstein, 1924) Let X 1 , . . . , X n be independent zeromean random variables. Suppose that |X i | ≤ M holds almost surely for all i. Then, for all positive t it holds that (Chernoff, 1952) ). Let X = n i=1 X i , where X i = 1 with probability p i and X i = 0 with probability 1 -p i , and all X i are independent. Let µ = E(X) = n i=1 p i . Then for all δ ∈ [0, 1] it holds that P (|X -µ| ≥ δµ) ≤ 2 exp(-δ 2 µ/3) and for any δ > 1 it holds that P (|X -µ| ≥ δµ) ≤ 2 exp(-δµ/3) Lemma A.3. Let y be a binomially distributed random variable with parameters n, p. Let n ′ ∈ N. Then if n ′ ≥ pn we have that P n i=1 X i ≥ t ≤ exp - 1 2 t 2 n i=1 E [X 2 i ] + 1 3 M t . Proposition A.2 (Chernoff bound P (|y -pn| > n ′ ) ≤ 2 exp (-n ′ /3) Else if n ′ = εpn we have that P (|y -pn| > n ′ ) ≤ 2 exp (-εn ′ /3) Proof of Lemma A.3. Note that E(y) = pn. Using the Chernoff bound we get that if n ′ ≥ pn P (|y -pn| > n ′ ) ≤ 2 exp(-(n ′ /np)np/3) = 2 exp(-n ′ /3). If n ′ = εpn the Chernoff bound implies that P (|y -pn| > n ′ ) ≤ 2 exp(-ε 2 np/3) = 2 exp (-εn ′ /3) . Lemma 3.1 in the main body splits the objective into 'large' and 'small' parts which we handle separately. Proof of Lemma 3.1. Note that for r ∈ R it holds that ℓ(r) = ln (1 + e r ) = ln e -r + 1 e r = ln e -r + 1 + ln(e r ) = ℓ(-r) + r.

Now the first equation follows immediately by ℓ(x

i β) = x i β + ℓ(-x i β) = |x i β| + ℓ(-|x i β|) for x i β > 0 and ℓ(x i β) = ℓ(-|x i β|) for x i β ≤ 0. Further we have that (x i β + ℓ(-x i β)) 2 = (x i β) 2 + 2ℓ(-x i β)x i β + ℓ(-x i β) 2 Thus the second equality follows by substituting ℓ( x i β) 2 with |x i β| 2 + 2ℓ(-|x i β|)|x i β| + ℓ(-|x i β|) 2 = (x i β) 2 + 2ℓ(-x i β)x i β + ℓ(-x i β) 2 for x i β > 0 and ℓ(-|x i β|) 2 for x i β ≤ 0.

B ESTIMATING THE SMALL PARTS OF f

We can bound the 'small' parts using the following lemma: Lemma B.1. For arbitrary i ∈ [n] it holds that ℓ(-|x i β|) < 1 and also 2ℓ(-|x i β|)|x i β| + ℓ(-|x i β|) 2 ≤ 3. Proof. First observe that ℓ(-|x i β|) ≤ ℓ(0) = ln(2) < 1, proving the first part of the lemma. Next note that ℓ(-|x i β|) = ln(1 + exp(-|x i β|)) = 1+exp(-|xiβ|) 1 1 t dt ≤ 1+exp(-|xiβ|) 1 1 dt = exp(-|x i β|). Using that ln(t) ≤ |t| for all t > 0 we conclude that ℓ(-|x i β|)|x i β| ≤ exp(ln(|x i β|) -|x i β|) ≤ e 0 = 1. Now combining everything we get that 2ℓ(-|x i β|)|x i β| + ℓ(-|x i β|) 2 ≤ 2 + 1 2 ≤ 3. Next we note that the optimal value of f (Xβ) is bounded from below: Lemma B.2. (Munteanu et al., 2021) For all β ∈ R d it holds that nf (Xβ) ≥ nf 1 (Xβ) ≥ n 2µ (1 + ln(µ)) = Ω n µ (1 + ln(µ)) . We use the previous two lemmas to show that our sketch approximates the given parts of f well enough with high probability. To this end, we set g 1 (t) = ℓ(-|t|), g 2 (t) = 2ℓ(-|t|)|t| + ℓ(-|2t|) and g(t) = g 1 (t) + λg 2 (t). Lemma B.3. Given any β ∈ R d with failure probability at most 2 exp(-m 1 ) the event E 0 holds that n ′ i=1 w i g(x ′ i β) - n i=1 g(x i β) ≤ ε • max n i=1 g(x i β), n 2µ ≤ εf (Xβ). Proof of Lemma B.3. The total weight of all buckets in a level less than h m is at most hmax h=1 b -h = b -1 • 1-b -hmax 1-b -1 ≤ 2 b ≤ ε 6µ . Now let k ∈ {1, 2}. For i ∈ [n], consider the random variable X i = g k (z i ) if z i is at level h m , and X i = 0 otherwise. Then we have E = E n i=1 X i = n i=1 p u g k (z i ) = p u n i=1 g k (x i β). Further we have X i ≤ 3 by Lemma B.1. It holds that E n i=1 X 2 i = n i=1 p u g k (z i ) 2 ≤ p u n i=1 3g(x i β) = 3E. We set L = p u • max n i=1 g k (x i β), n 2µ ≥ E. By Assumption A.1 we have that p u ≥ 64µm1 ε 2 n . Thus, using Bernstein's inequality we get that P n i=1 X i -E ≥ ε 2 • L ≤ exp -ε 2 L 2 /8 3E + E = exp -ε 2 L 32 ≤ exp -ε 2 p u n/µ 64 ≤ exp(-m 1 ). Using the union bound for k = 1 and k = 2 yields that P   n ′ i=1 w i g(x ′ i β) - n i=1 g(x i β) > ε • max n i=1 g(x i β), n 2µ   ≤ 2 exp(-m 1 ) . By Lemma B.2 we have f (Xβ) ≥ n 2µ . It also holds that f (Xβ) ≥ n i=1 g(x i β). We thus conclude that ε • max n i=1 g(x i β), n 2µ ≤ εf (Xβ). C ESTIMATING THE LARGE PARTS ∥z∥ 1 AND ∥z + ∥ 1 Lemma C.1. It holds that q∈Q * ∥W + q ∥ 1 ≥ (1 -2ε)∥z + ∥ 1 . Proof of Lemma C.1. First note that zi<2 -qm z i ≤ n • ε (µ + 1)n = ε/(µ + 1). Second note that q≤qm,q / ∈Q * ∥W + q ∥ 1 ≤ q m • ε (µ+1)qm ≤ ε/(µ + 1) . By the µ-condition we have that ∥z -∥ 1 ≤ µ∥z + ∥ 1 and thus we get that 1 = ∥z -∥ 1 + ∥z + ∥ 1 ≤ µ∥z + ∥ 1 + ∥z + ∥ 1 . Consequently, ∥z + ∥ 1 ≥ 1 µ+1 and q∈Q * ∥W + q ∥ 1 ≥ ∥z + ∥ 1 -2ε (µ+1) ≥ (1 -2ε)∥z + ∥ 1 . C.1 ANALYSIS FOR A SINGLE LEVEL Fix h ∈ [0, h m ]. First consider the number of elements at a fixed level h. We can view it as a binomial random variable with parameters n and p h since the probability for any row to appear at level h is p h . Since we fix h in this subsection, we set M = M h = p h n, p = p h = M n and N = N h . We set U ⊂ [n] to be the set of elements that are sampled at level h. We also set µ z = z i <0 |zi| z i >0 |zi| ≤ µ. This and the following subsection are dedicated to proving the existence of bounds q h (1), q h (2), q h (3) and q h (4) as described in the high level overview, Section 3. More precisely we show the following: Lemma C.2. With probability at least 1 -δ hm the weight classes W q for q ≥ q (M,N ) (4) := log 2 (γ -1 2 ) := log 2 ( 2N ln(N hmax/δ) pε 2 ) and q ≤ q (M,N ) (1) := log 2 ( µzδ phmax ) have zero contribution to B G + (B), i.e., for any bucket B we have zi∈B\Ir z i ≤ 0 where I r = {i ∈ [n] | z i ∈ W q , q ∈ [q (M,N ) (1), q (M,N ) (4)]}. Further, with failure probability at most exp(-Ω(m 1 )) there exists, for each log 2 ( 8qmµzm1 ε 3 p )) =: q (M,N ) (2) ≤ q ≤ q (M,N ) (3) := log 2 ( N ε 2 4p ), a set W * q such that i∈W * q G(B i ) ≥ (1 -ε) 2 ∥W + q ∥ 1 • M n . It thus holds that q (M,N ) (2) -q (M,N ) (1) = log 2 8q m m 1 h m ε 3 δ q (M,N ) (3) -q (M,N ) (2) = log 2 N ε 5 32m 1 µq m =: log 2 (b) q (M,N ) (4) -q (M,N ) (3) = log 2 8 ln(N h m /δ) ε 4 . If N = M then we set q (M,N ) (3) = q (M,N ) (4) = ∞. If M = n then we set q (M,N ) (1) = q (M,N ) (2) = 0. We set q h (i) = q (M h ,N h ) (i) for i ∈ {1, 2, 3, 4} and Q h = [q h (2), q h (3)] to be the well-approximated weight classes, and R h = [q h (1), q h (4)] to be the relevant weight classes at level h. We further define the following threshold and set: γ 1 := p 3m 1 Y 1 := {i ∈ [n] | |z i | ≥ γ 1 } Here Y 1 is the 'set of large elements'. We set B h to be the set of all buckets at level h. Recall that m 1 ∈ R is a lower bound on the negative logarithm of the failure probability, which we will need later when union bounding over all failure probabilities. Also recall that G(B) = i∈B z i is the sum of all rows in a bucket B. The following lemma yields the inner bounds, i.e., bounds for q h (2) and q h (3), which are the weight class indices that are well represented by U . The first two items show that there are at most εN buckets at level h that either contain a large element or have a large sum of small contributions. The third item shows that if W q has sufficiently many elements, then there exists a large subset W * q where each element is in a bucket with no other large entry such that ∥W * q ∥ 1 is close to ∥W + q ∥ 1 • M n . The fourth item shows that zi∈W * q G(B i ) is close to ∥W * q ∥ 1 . Lemma C.3. The following hold: 1) |Y 1 ∩ U | ≤ εN/2 with failure probability at most exp(-m 1 ); 2) Let B = {B ∈ B h | i∈B\Y1 |z i | ≤ 4p εN }. Then |B| ≥ (1 -ε 2 )N with failure probability at most exp(-m 1 ); 3) Assume that q ≥ log 2 ( 8qmµzm1 ε 3 p ) and that W + q is important or |W q | ≥ 8m 1 ε -2 • p -1 . Then with failure probability at most exp(-m 1 ) there exists W * q ⊂ W + q ∩ B such that ∥W * q ∥ 1 ≥ (1 -ε) 2 ∥W + q ∥ 1 • p and each element of W * q is in a bucket in B containing no other element of Y 1 ; 4) If q ≤ log 2 ( N ε 2 4p ) and W * q as in 3) exists, then with failure probability at most exp(-m 1 ) it holds that i∈W * q G(B i ) ≥ (1 -ε)∥W * q ∥ 1 . Proof. 1) Note that |Y 1 | ≤ γ -1 1 since ∥z∥ 1 = 1 and that we can view |Y 1 ∩ U | as a binomial random variable with parameters |Y 1 | and p = M n . Thus, the expected number of elements of Y 1 at level h is bounded by |Y 1 | • M n ≤ p γ1 = 3m 1 ≤ εN 4 since N ≥ 12m 1 (see Assumption A.1 ). Thus, we get by Lemma A.3 that P |Y 1 ∩ U | ≥ εN 2 ≤ P |Y 1 ∩ U | -|Y 1 | • p ≥ εN 4 ≤ P (|Y 1 ∩ U | -|Y 1 | • p ≥ 3m 1 ) ≤ exp (-3m 1 /3) ≤ exp(-m 1 ). 2) For i ∈ T = [n]\Y 1 we set X i = |z i | if i ∈ U and X i = 0 otherwise. Since i∈T |z i | ≤ ∥z∥ 1 = 1 we have that E( i∈T X i ) = p • i∈T |z i | ≤ p. Since all 'large elements' are in Y 1 we have that X i < γ 1 for all i ∈ [n] and thus E i∈T X 2 i = i∈T p|z i | 2 ≤ i∈T pγ 1 |z i | = pγ 1 i∈T |z i | ≤ pγ 1 . Using Bernstein's inequality we get P i∈T X i ≥ 2p ≤ exp - p 2 /2 pγ 1 + pγ 1 /3 ≤ exp - p 3γ 1 = exp(-m 1 ). This implies that i∈T X i ≤ 2p with failure probability at most exp(-m 1 ). Now if i∈T X i ≤ 2p then there can be at most εN 2 buckets B with G(B \ Y 1 ) ≥ 4p εN . 3) First note that if q ≥ log 2 ( 8qmµzm1 ε 3 p ) is important then 2 -q • |W + q | ≥ ∥W + q ∥ 1 ≥ ε qmµz , which implies that |W + q | ≥ 2 q ε qmµz ≥ 8m 1 ε -2 • p -1 . Assume that all entries of Y 1 \ W + q have been assigned and let B ′ ⊂ B be the buckets of B with no elements from Y 1 \ W + q . By 1) and 2) there are at least (1 -ε)N buckets in B ′ . For z i ∈ W + q consider the random variable that takes the value Z i = z i if i ∈ B∈B ′ B and Z i = 0 otherwise. Set Z = zi∈W + q Z i . We have Z i = z i if element i is sampled at level h and sent to a bucket in B ′ , which happens with probability at least p • (1-ε)N N = (1 -ε)p. We thus have for the expected value of Z that E(Z) ≥ (1 -ε)p • ∥W + q ∥ 1 ≥ (1 -ε)p • 2 -q-1 • |W + q | ≥ (1 -ε) • 2 -q-1 • 8m 1 ε -2 ≥ 2 -q • 3m 1 ε -2 . Further, the maximum value of any Z i is 2 -q and the probability that Z i = z i is upper bounded by p. Consequently, the variance of Z is bounded by zi∈W + q E(Z 2 i ) ≤ zi∈W + q pz 2 i ≤ 2 -q zi∈W + q pz i = 2 -q E(Z). Using Bernstein's inequality we get that P Z < (1 -ε) 2 p • ∥W + q ∥ 1 ≤ P (Z -E(Z) > εE(Z)) ≤ exp -ε 2 E(Z) 2 /2 2 -q E(Z) + 2 -q εE(Z)/3 ≤ exp -ε 2 E(Z) 3 • 2 -q ≤ exp (-m 1 ) . We set W * q = {z i ∈ W + q | Z i = z i }. 4) By 2) and 3) we have that any entry z i ∈ W * q is in a bucket B with j∈B\{i} |z j | ≤ 4p εN . Thus, we have for z i ≥ 4p ε 2 N that j∈Bi z j ≥ z i -4p εN ≥ (1 -ε)z i . Now we conclude i∈W * q G(B i ) ≥ i∈W * q (1 -ε)z i = (1 -ε)∥W * q ∥ 1 . Note that if all buckets contain only a single element then we can remove the condition q ≤ log 2 ( N ε 2 4p ). Hence, we can set q (M,N ) (3) = q (M,N ) (4) = ∞ if N = M (respectively, h = h m ). For the outer bounds, i.e., the borders of the interval of weight classes that can have a non-negligible contribution to U , we need the following parameters defining the set of small elements: γ 2 := pε 2 3N ln(N h max /δ) Y 2 = {i ∈ [n] | |z i | ≤ γ 2 } We further set E to be the expected value of an entry chosen uniformly at random from Y 2 . Lemma C.4. The following hold: 1) If E ≤ -ε/n, then for any bucket B that contains only elements of Y 2 , we have G(B) = i∈B z i ≤ 0 with failure probability at most δ N hmax . 2) U contains no element i with z i ≥ phmax δ with failure probability at most µzδ hmax . Proof. 1) First consider a single bucket B containing only elements of Y 2 . For i ∈ [n], let X i be a random variable that attains the value X i = z i if i ∈ B and X i = 0 otherwise. The expected value of G(B) = i∈[n] X i is E ′ := n • p N • E ≤ -pε N . Further, we have that E   i∈[n] X 2 i   = i∈Y2 p N • z 2 i ≤ γ 2 • i∈Y2 p N • |z i | = γ 2 p N since all X i are bounded by γ 2 by assumption. Thus, applying Bernstein's inequality yields P (G(B) > 0) ≤ P   i∈[n] X i -E ′ ≥ |E ′ |   ≤ exp -|E ′ | 2 /2 γ 2 p N + γ 2 |E ′ |/3 ≤ exp -ε • p/(N ) 2γ 2 (p/(N |E ′ |) + 1/3) ≤ exp -ε • p/(N ) 2γ 2 (ε -1 + 1/3) ≤ exp -ε 2 • p/(N ) 3γ 2 ≤ exp -ln N h max δ = δ N h max 2) Recall that zi>0 z i ≤ 1/µ z . Thus, there are at most nδ µzM hmax entries with z i ≥ M hmax nδ . The expected number of those entries in U is thus at most nδ µzM hmax • M n ≤ δ hmax , which also upper bounds the probability of at least one entry with z i ≥ M hmax nδ being contained in U . Putting both lemmas together we get all bounds q h (i) except q 0 (2), which will be handled in the next subsection.

C.2 HEAVY HITTERS

In this subsection we will analyze the level containing all entries. Our goal is to show that we can indeed set q 0 (2) = 0 in Lemma C.2. Let U be as before and assume that M = n. Let Q H = {q ∈ Q 0 | |W q | ≥ 8m 1 ε -2 } where Q 0 = {q ≤ log 2 ( 8qmµm1 ε 3 )}. We set H = q∈Q H W q to be the class of heavy hitters. We let u ∈ R n ≥0 denote the vector whose coordinates u i denote the i-th ℓ 1 -leverage scores, i.e., u i = max β∈R d |xiβ| j∈[n] |xj β| . Lemma C.5. (Munteanu et al., 2021) If u i is the k-th largest coordinate of u, then for z in the subspace spanned by the columns of X it holds that |z i | ≤ d k ∥z∥ 1 . Further, it holds that n i=1 u i ≤ d. Lemma C.6. Let Y 3 = {i | u i ≥ γ 3 } and N 1 = |Y 3 | where γ 3 = ε 3 8qmµm1 . Further, for j ∈ Y 3 let C j = {B | i∈B\{j} u i ≥ εγ 3 }. Then for all j ∈ Y 3 we have that |C j | is bounded by N 2 = d(εγ 3 ) -1 . Further if N ≥ N 1 N 2 κ -1 for κ ∈ (0, 1/2), then with probability 1 -2κ, each member of Y 3 is in a bucket in B 0 \ C j . Proof. Since by Lemma C.5 it holds that n i=1 u i ≤ d, there can be at most d 1 εγ3 = N 2 buckets B with i∈B u i ≥ εγ 3 . In particular this implies that |C j | ≤ N 2 . The probability of any element of j ∈ Y 3 getting assigned to a bucket in C j is at most N2 N ≤ κ N1 . Using the union bound the probability that any element j ∈ Y 3 is assigned to a bucket in B 0 \ C j is at most κ. We apply Lemma C.6 with κ = δ. We denote by E 1 the event that all coordinates in j ∈ Y 3 are in a bucket in B 0 \ C j . By Lemma C.6 E 1 holds with probability at least 1 -δ for an appropriate N = N ′ 0 = N 1 N 2 δ -1 = 64d 2 q 2 m µ 2 m 2 1 δε 7 = O( d 2 q 2 m µ 2 m 2 1 δε 7 ). For any entry z i ∈ H we have z i ≥ γ 3 and thus by Lemma C.5, we have i ∈ Y 3 . It remains to show that the remaining entries in the buckets containing a heavy hitter only have a small contribution. Lemma C.7. Assume E 1 holds. Then for any z i ∈ H we have G(B i ) ≥ (1 -ε)z i . Proof. Let z i ∈ H. Note that by Lemma C.5 i ∈ Y 3 . By E 1 we have that j∈B\{i} u j ≤ εγ 3 ≤ εz i . We conclude that G(B i ) ≥ z i - j∈Bi\{i} |z j | ≥ z i - j∈Bi\{i} u j ≥ z i -εz i ≥ (1 -ε)z i .

C.3 CONTRACTION BOUNDS FOR A SINGLE POINT

We set U h to be the rows z i sampled at level h. Combining previous subsections we get the following lemma: Lemma C.8. Assume that E 1 holds. Denote by z ′ i the i-th row of SXβ for i ∈ n ′ . Then with failure probability at most (2h m + 2q m )e -m1 it holds that i∈n ′ ,z ′ i ≥0 w i z ′ i ≥ (1 -4ε)∥(Xβ) + ∥ 1 . Proof. By Lemma C.3 and Lemma C.7 we have that for each important weight class W + q there exists a subset W * q ⊆ U h with i∈W * q G(B i ) ≥ (1 -ε) 2 ∥W + q ∥ 1 p h . For q ∈ Q H we can set W * q = W + q . Then using Lemma C.1 we get i∈n ′ ,z ′ i ≥0 w i z ′ i ≥ q∈Q * p -1 h i∈W * q G(B i ) ≥ q∈Q * (1 -ε) 2 ∥W + q ∥ 1 ≥ (1 -2ε)(1 -ε) 2 ∥(Xβ) + ∥ 1 ≥ (1 -4ε)∥(Xβ) + ∥ 1 . C.4 DILATION BOUNDS Given β ∈ R d and z = Xβ set Z 0 = Z 0 (β) ⊂ Z = {z 1 , . . . , z n } to be the set of the (1 -ε)n largest entries ordered by absolute value. In other words, we remove the εn smallest entries. Similarly we set Z 1 = Z 1 (β) ⊂ Z to be the set of the (1 -2ε)n largest entries. Again we assume that ∥z∥ 1 = 1. Our next goal is to show that if f (z) is small then zi∈Z0 z i remains negative even if we remove the smallest entries. Here small means negative with large absolute value. This shows that the assumption of Lemma C.4 1) is fulfilled. Lemma C.9. If f (Xβ) < (1 -2ε)f (0) then it holds that zi∈Z0,zi≤0 |z i | ≥ (1 + ε) zi≥0 |z i | Proof. Let X 1 denote the matrix X where the columns not corresponding to an entry of Z 1 are removed. We denote by f the function f restricted to |Z 1 | entries, i.e., f (Xβ) = xi∈X1 ℓ(x i β). Since ℓ is always larger than 0, removing 2εn entries can only reduce f . We thus have that f (0) = (1 -2ε)f (0) ≥ f (Xβ) = f (Z) ≥ f (Z 1 ). Now consider the function ϕ(r) = f (r • Xβ). Note that the derivative of ϕ at zero is given by ϕ ′ (0) = xi∈X1 e 0 e 0 +1 • x i β = 1 2 • zi∈Z1 z i . Since f is convex ϕ is also convex. In particular this means that f (Xβ) < f (0) implies ϕ ′ (0) < 0. Thus it must hold that zi∈Z1 z i < 0, or equivalently, zi∈Z1,zi<0 |z i | > zi∈Z1,zi>0 |z i |. Since all entries in Z 0 \ Z 1 are less than or equal to any entry in Z 0 , we have that zi∈Z0,zi<0 |z i | ≥ 1 1 -ε zi∈Z1,zi<0 |z i | ≥ (1 + ε) zi>0 |z i |. The following lemma gives us an upper bound on the expected value of G + (Z). Lemma C.10. If for all i ≤ h m -1 it holds that q (MiNi) (4) < q (M i+k N i+k ) (1) and N 0 ≥ N ′ 0 , then the expected contribution of any weight class W + q is at most k • ∥W + q ∥ 1 . Proof. Consider any weight class W + q . For any level h it follows by Lemma C. 2 that if q / ∈ [q (M h N h ) (1), q (M h N h ) (4)] then W + q has zero contribution at level h, i.e., either there are no elements of W + q at level h or we have W + q ⊂ Y 1 and for any bucket B of level h it holds that i∈Y1∩B z i ≤ 0. At any level the expected contribution of W + q is bounded by p -1 h • i∈W + q p h z i = ∥W + q ∥ 1 . This upper bound would be tight if all entries of Z were positive. Hence, the expected contribution of W + q is upper bounded by the number of levels h with q ∈ [q (M h N h ) (1), q (M h N h ) (4)]. Since q (M h N h ) (1) and q (M h N h ) (4) are monotonically increasing in h, it follows that if q (MiNi) (4) < q (M i+k N i+k ) (1) then any q can be contained in at most k intervals of the form [q (M h N h ) (1), q (M h N h ) (4)], concluding the lemma. See Figure 2 for an illustration. h = 0 h = 1 h = 2 h = hm q0(2) = 0 q0(3) q0(4) q1(2) q1(1) q1(3) q1(4) q2(1) q2(2) q2(3) q2(4) qhm (1) qhm (2) qhm (3) = qm . . . q Wq is relevant at levels 0, 1, 2. Wq is well represented at level 1. If the green and the blue block do not touch i.e. if qh-1(4) < qh+1(1), then the expected contribution of any weight class is at most twice its original contribution. Lemma C.10 can be used to show that the expected contribution of any weight class to G + (Z) is at most twice its total weight: Lemma C.11. If we choose N i = N := max{N ′ 0 , 2048m 2 1 µ ln(N hm/δ)q 2 m hm ε 12 δ } for all i ∈ [h m ], and M i solving the equation q (Mi-1,N ) (3) = q (Mi,N ) (2) then the expected contribution of any weight class W + q is at most 2∥W + q ∥ 1 . Proof. We set q i (j) = q (MiNi) (j). We first show that q (i+2) (1) -q i (4) can be expressed using the terms q i+1 (3) -q i+1 (2), (q i+2 (2) -q i+2 (1)) and (q i (4) -q i (3)), which are the same for each i if the number of buckets at each level is identical, i.e., for all j ≤ h q it holds that N j = N i . Observe that q (i+2) (1) -q i (4) = q i+2 (2) + q i+2 (1) -q i+2 (2) -(q i (3) + q i (4) -q i (3)) = q i+2 (2) -q i (3) -(q i+2 (2) -q i+2 (1)) -(q i (4) -q i (3)) = q i+1 (3) -q i+1 (2) -(q i+2 (2) -q i+2 (1)) -(q i (4) -q i (3)). Figure 2 illustrates those three terms. Using Lemma C.2 we can bound the sum of the two subtracted terms by (q i+2 (2) -q i+2 (1)) + (q i (4) -q i (3)) = log 2 8q m m 1 h m ε 3 δ + log 2 8 ln(N h m /δ) ε 3 = log 2 64m 1 ln(N h m /δ)q m h m ε 7 δ . By Lemma C.2 we have that q i+1 (3) -q i+1 (2) ≥ log 2 N ε 5 32m1µqm . Thus, combining both equations we get that q (i+2) (1) -q i (4) = log 2 N ε 5 32m 1 µq m -log 2 64m 1 ln(N h m /δ)q m h m ε 7 δ = log 2 N ε 12 δ 2048m 2 1 µ ln(N h m /δ)q 2 m h m . If N ≥ 2048m1µ ln(N hm/δ)q 2 m hm ε 12 δ then we have q (i+2) (1) -q i (4) ≥ 0 and thus by Lemma C.10, the expected contribution of any weight class W + q is at most 2∥W + q ∥ 1 . If N < 2048m 2 1 µ ln(N hm/δ)q 2 m hm ε 12 δ we have the following adaptation of the lemma: Lemma C.12. If for some k ∈ N we choose N i = N ≥ 32m1µqm ε 5 • 64m1 ln(N hm/δ)qmhm ε 7 δ 1/(k-1) for all i ∈ [h m ], and M i solving the equation q (Mi-1,N ) (3) = q (Mi,N ) (2), then the expected contribution of any weight class W + q is at most k∥W + q ∥ 1 . Proof. We generalize the proof of Lemma C.11. We can substitute q (i+k) (1) -q i (4) as follows: q (i+k) (1) -q i (4) = q (i+k) (1) -q (i+k) (2) + q i (3) -q i (4) + q (i+k) (2) -q i (3) = q (i+k) (2) -q i (3) -(q (i+k) (2) -q (i+k) (1)) -(q i (4) -q i (3)) = q (i+k-1) (3) -q i+1 (2) -(q (i+k) (2) -q (i+k) (1)) -(q i (4) -q i (3)) = k-1 j=1 q (i+j) (3) -q i+j (2) -(q (i+k) (2) -q (i+k) (1)) -(q i (4) -q i (3)). The difference to the proof of Lemma C.11 is the telescoping sum. We have that k-1 j=1 q (i+j) (3) -q i+j (2) = (k -1) • log 2 N ε 5 32m 1 µq m = log 2 N ε 5 32m 1 µq m k-1 . Thus if N ≥ 32m1µqm ε 5 • 64m1 ln(N hm/δ)qmhm ε 7 δ 1/(k-1) we have that k-1 j=1 q (i+j) (3) -q i+j (2) ≥ log 2 64m1 ln(N hm/δ)qmhm ε 7 δ . Further note that (q i+2 (2) -q i+2 (1)) + (q i (4) -q i (3)) = log 2 64m1 ln(N hm/δ)qmhm ε 7 δ as before. We conclude that q (i+k) (1) -q i (4) > 0. Consequently, applying Lemma C.10 finishes the proof. Next we want to show how we can reduce the expected contribution of all weight classes below 2∥W + q ∥ 1 . To this end we first increase the number of buckets at each level so as to get log 2 N ε 5 32m 1 µq m ≥ k log 2 64m 1 ln(N h m /δ)q m h m ε 7 δ . Note that the expected contribution of any important weight class W + q is at least ∥W + q ∥ 1 . Moreover, the above choice ensures that all but a k-th fraction of weight classes have an expected contribution of exactly ∥W + q ∥ 1 , and only the remaining k-th fraction has a larger expected contribution that crucially is still bounded by 2∥W + q ∥ 1 . Then the last step is to add a random shift so that the probability of each weight class W + q for having an expected contribution of 2∥W + q ∥ 1 is at most 1 k . To simplify notation we set N ′ 1 = 32m1µqm ε 5 and N ′ 2 = 64m1 ln(n)qmhm ε 7 δ and assume that n ≥ N k h m /δ. Lemma C.13. Let γ = 1 k < 1 for some k ∈ N. Assume that N 0 is chosen uniformly at random from N (1) , . . . N (1/γ) where N (i) = N ′ 0 • N ′i 2 . Further let N i = N = N ′ 1 • N ′k+1 2 for any i > 0. Then the expected contribution of any weight class W + q is at most (1 + γ)∥W + q ∥ 1 . Proof. First note that log 2 N ε 5 32m 1 µq m -k log 2 64m 1 µ ln(n)q m h m ε 7 δ = log 2 (N/N ′ 1 ) -log 2 (N ′k 2 ) ≥ 0. This shows that the relation of weight classes that are relevant on two levels to the weight classes that are relevant on only one level is 1 : k. By choosing N 0 at random we introduce a shift by i log 2 (N ′ 2 ), which is the maximal length of a block [q i-1 (1), q i (4)]. Hence, for each q ∈ N there can be only one i such that q is relevant in two levels. This implies that the expected contribution of W + q is at most k-1 k • ∥W + q ∥ 1 + 1 k • 2∥W + q ∥ 1 = (1 + 1 k )∥W + q ∥ 1 .

C.5 NET ARGUMENT

To get a weak weighted sketch we need the contraction bounds not just for a single solution but for all β ∈ R d . For now we ignore the variance regularization and focus only on f 1 , i.e., on plain logistic regression. We first show that if the distance of two vectors v, v ′ ∈ R n is small then |f 1 (v) -f 1 (v ′ )| is also small. Lemma C.14. For any v, v ′ ∈ R n with ∥v -v ′ ∥ 1 ≤ ε it holds that |f 1 (v) -f 1 (v ′ )| ≤ ε. Proof. Since ℓ ′ (v) = e v e v +1 ≤ 1 we get that |f 1 (v) -f 1 (v ′ )| ≤ n i=1 |ℓ(v i ) -ℓ(v ′ i )| ≤ n i=1 |v i -v ′ i | = ∥v -v ′ ∥ 1 which proves the lemma. Lemma C.15. Assume that for β ∈ R d it holds that |f 1 (X ′ β) -f 1 (Xβ)| ≤ ε. Then for any β ′ ∈ R d with ∥Xβ -Xβ ′ ∥ 1 ≤ ε/(b hm h m ) it holds that |f 1 (Xβ ′ ) -f 1 (X ′ β ′ )| ≤ 3ε. Proof. It holds that ∥X ′ (β -β ′ )∥ 1 = ∥SX(β -β ′ )∥ 1 ≤ b hm h m ∥X(β -β ′ )∥ 1 ≤ ε since for each i ∈ [n] there are at most h m columns j such that S ij ̸ = 0 and each entry of S is bounded by b hm . Thus, using the triangle inequality and applying Lemma C.14 yields |f 1 (Xβ ′ ) -f 1 (X ′ β ′ )| ≤ |f 1 (X ′ β ′ ) -f 1 (X ′ β)| + |f 1 (X ′ β) -f 1 (Xβ)| + |f 1 (Xβ) -f 1 (Xβ ′ )| ≤ ε + ε + ε ≤ 3ε. Lemma C.16. There exists a net N ⊂ R d of size |N | = exp (O(d ln(n))) such that for any point y ∈ R d with ∥Xy∥ 1 ≤ nµ there exists a point y ′ ∈ N such that ∥Xy ′ -Xy∥ 1 ≤ ε µb hmax hm . Proof. We set N = β = v • ε db hm h m | v ∈ Z d with ∥v∥ ∞ ≤ dnµb hmax h m ε . Then for any y ∈ R with ∥Xy∥ 1 ≤ nµ the point Xy ′ = ⌊ db hm hm ε • Xy⌋ • ε db hm hm is in N and it holds that ∥Xy -Xy ′ ∥ 1 ≤ d • ε db hm hm = ε b hm hm . Further we have |N | ≤ d 2 nµb 2hm h 2 m ε 2 d = exp (O(d ln(n)). Combining Lemma C.15 and Lemma C.16 we get: Lemma C.17. There exists a net N ⊂ R d with |N | = exp (O(d ln(n))) such that if |f 1 (X ′ β) - f 1 (Xβ)| ≤ ε holds for any β ∈ N , then for any β ′ ∈ R d with ∥Xβ ′ ∥ 1 ≤ nµ it holds that |f 1 (X ′ β ′ ) -f 1 (Xβ ′ )| ≤ 3ε.

C.6 CONSTANT FACTOR APPROXIMATION CHANGES

To prove the first part of Theorem 2 we need one more tweak in level 0. Since we only aim to achieve a constant factor approximation with constant probability we can assume that ε and δ are constant.

Heavy hitters -alternative version

There is another way of handling heavy hitters. Using it we can reduce the sketch size at the cost of running time. The idea is that each row gets sampled multiple times. More precisely, we replace level 0 by the following sketch: At level 0 we map each element to s = 8m 1 q m µ/ε 2 rows. Technically we are getting rid of heavy hitters this way. To compensate the fact that each element appears multiples times, we set the weight of buckets of level 0 to w 0 = 1/s.

C.7 PROOF OF THEOREM 2

We are now ready to prove Theorem 2: Proof of Theorem 2. If β = 0 is a 1 -2ε approximation, then we get the dilation bounds for free since f 1w (X ′ β) = ln(2) = f 1 (Xβ). Otherwise let β * be the minimizer of f 1 (Xβ). Note that β * satisfies the assumption of Lemma C.9. 1) We fix constants ε = 1/8 and δ = 1/8. We use the alternative approach for handling heavy hitters and define M i and N i as in Lemma C.12 for some constant k = 1 + 1 c and set h m = min{i | M i ≤ N }. By Lemma C.12 the expected contribution of any weight class is at most k∥W + q ∥ 1 . Thus using Markov's inequality we can bound f 1w (SXβ * ) ≤ akf 1 (Xβ * ) with probability 1 a for any a ∈ N. In other words, it is constant with constant probability. By our choice of M i and N i , the contraction bounds hold for any Xβ with failure probability at most (2h m + 2q m + 2)e -m1 by combining Lemma C.8 and Lemma B.3. Setting m 1 = O(d ln(n)) and using Lemma C.17 we get that the contraction bounds hold for all β ∈ R d with ∥Xβ∥ 1 ≤ nµ. We note that the contraction bounds can be extended to any β ∈ R d since f 1 (Xβ) ≈ ∥Xβ∥ 1 if ∥Xβ∥ 1 > nµ. We refer to (Munteanu et al., 2021) for details. Further note that q i (2) < q i (3), and thus h m ≤ log 2 (2 qm ) = O(ln(n)). The number of buckets at each level is N = 32m1µqm (1/8) 5 • ( 64m1 ln(N hm/δ)qmhm (1/8) 7 ) c . We specify the number r of rows of SX, which is r = h m N . Since h m , q m = O(ln(n)) and m 1 = O(d ln(n)) we get that r = O(µd 1+c ln(n) 2+4c ). The running time of our algorithm is O(µd ln(n)nnz(X)) since each row x i gets assigned to O(µd ln(n)) buckets. ) and T = O(nnz(X)) we can get an approximation factor of α = 1 + (1 + ε)a and failure probability of P = δ + 1 a . There are only a few differences compared to the proof of the first part: instead of Lemma C.12 we use Lemma C.11. Hence we need the number of buckets to be N = max N ′ 0 , 2048m 2 1 µ 2 ln(N h m /δ)q 2 m h m ε 12 δ = 2048m 2 1 µ 2 ln(N h m /δ)q 2 m h m ε 12 δ . Consequently we have that r = h m N = O( µ 2 d 4 ln(n) 7 ε 12 δ ). Since every row gets assigned to O(1) buckets the running time is O(nnz(X)). Now assume that the contraction bound holds for β * . Then Y = f 1w (SXβ * ) -(1 -ε)f 1 (Xβ * ) is a positive random variable with expected value at most (1 + ε)f 1 (Xβ * ), and thus using Markov's inequality gives us that Y > a(1 + ε)f 1 (Xβ * ) holds with probability at most 1 a . Hence it follows that f 1w (SXβ * ) ≤ f 1 (Xβ * ) + a(1 + ε)f 1 (Xβ * ) with failure probability at most 1 a . 2) The proof is again similar to 2'). The only difference is that we use Lemma C.13 instead of Lemma C.11. Hence the number of buckets at each level is bounded by N = max{N ′ 0 , N ′ 1 • N ′1+ε -1 2 }. Thus r = h m N = O( d 2 hmq 2 m µ 2 m 2 1 δε 7 + 32dµ ln(n) 2 ε 5 • ( 64d ln(n) 4 ε 7 δ ) 1+ε -1 ). D APPROXIMATING ∥Xβ -Y ∥ 1 The sketching algorithm is the same as before and also the analysis is very similar to the previous part. We start with a fixed point z = (X, -Y )β ′ , where β ′ = (β, 1) ∈ R d and analyze Sz. Again we assume that ∥z∥ 1 = 1. Instead of weight classes W + q we use weight classes W q = {i ∈ [n] | |z i | ∈ (2 -q-1 , 2 -q ]}. Since we are only dealing with absolute values, which are symmetric, we no longer need to parameterize by µ. We can continue to use the same definitions for q h (1), q h (2) and q h (3) when setting µ in those bounds to be 1. We will only slightly change q h (3) since we will need another trick to prove the second outer bound q ′ h (4).

D.1 DILATION BOUNDS FOR ℓ 1

For approximating ℓ 1 we need a different approach for q h (4) when bounding the contribution of small entries at each level. The idea is to use a Ky-Fan norm argument to remove the smallest contributions from the ℓ 1 -norm. At a fixed level h we put B h to be the set of buckets at level h and B ′ h to be the set of buckets with the p2 q h (3) ≤ εN hm largest entries with respect to |G(B)| where q h (3) := ln(min{ εN 2hm , N ε 2 4p }). We further define K(h) = B∈B ′ h |G(B)|. Since q∈[q h (2),q ′ h (3)] W * q contains at most p2 q ′ h (3) elements, we have that K(h) ≥ ∥W * q ∥ 1 . We set q ′ h (4) = ln( 3N hm ln N hm/δ pε ). Set Y 2 = Y 2 (h) = {i ∈ [n] | |z i | ≤ γ 2 := cp N ln(N hm/δ) }. Lemma D.1. With failure probability at most δ hmN it holds that for any bucket B at level h we have that i∈B∩Y2 |z i | ≤ max{2 • p • ∥Y 2 ∥ 1 N , p N ∥Y 2 ∥ 1 + ε h m } Proof. Fix a bucket B at level h. For i ∈ Y 2 let X i = z i if i ∈ B and X i = 0 otherwise. Then we have E := E( i∈Y2 X i ) = p•∥Y2∥1 N . Further we have E( i∈Y2 X 2 i ) = i∈Y2 p N • z 2 i ≤ γ2p N • i∈Y2 |z i | = γ 2 E. We set λ = max{E, ε N hm } Then using Bernstein's inequality we get that P ( i∈Y2 X i ≥ E + λ) ≤ exp -λ 2 /2 γ 2 E + γ 2 E/3 ≤ exp -λ 2 /2 γ 2 λ + γ 2 λ/3 ≤ exp -λ 3γ 2 ≤ exp -pε 3N h m γ 2 ≤ exp (-ln(N h m /δ)) ≤ δ h m N . Lemma D.2. With failure probability at most δ it holds that h≤hm i∈Y2(h)∩ B∈B ′ h B z i ≤ ε Proof. Using the union bound over the event from Lemma D.1 over all N h m buckets, using that |B ′ h | ≤ εN /2h m and max{2 • ∥Y 2 ∥ 1 , ∥Y 2 ∥ 1 + ε hm } ≤ 2 we get that i∈Y2(h)∩ B∈B ′ h B z i ≤ εN 2h m • 2p N ≤ ε h m . holds for every level h with failure probability at most δ. Summing up over all levels we get h∈hm i∈Y2(h)∩ B∈B ′ h B z i ≤ ε. We have the following lemmas using similar proofs as in the previous section: Lemma D.3. If for some k ∈ N we choose N i = N ≥ 32m1qmhm ε 5 • 64m1 ln(N hm/δ)qmh 2 m ε 6 δ 1/(k-1) for all i ∈ [h m ] and M i solving the equation q (Mi-1,N ) (3) = q (Mi,N ) (2), then the expected contribution of any weight class W q is at most (k + ε)∥W q ∥ 1 . Here the additional ε comes from Lemma D.1. We set N ′′ 0 = N ′ 0 , N ′′ 1 = 32m1qmhm ε 5 and N ′′ 2 = 64m1 ln(N hm/δ)qmh 2 m ε 6 δ and assume that n ≥ N k h m /δ. Lemma D.4. Let γ = 1 k < 1 for some k ∈ N. Assume that N 0 is chosen uniformly at random from N (1) , . . . N (1/γ) where N (i) = N ′′ 0 • N ′′i 2 . Further let N i = N = N ′′ 1 • N ′′k+1 2 for any i > 0. Then the expected contribution of any weight class W q is at most (1 + γ)∥W q ∥ 1 . D.2 NET ARGUMENT For β ∈ R d+1 we set g 1 (β) = ∥(X, -Y )β)∥ 1 and g 2 (β) = ∥(SX, -SY )β)∥ 1 Lemma D.5. Assume that for β ∈ R d+1 it holds that |g 1 (β) -g 2 (β)| ≤ ε. Then for any β ′ ∈ R d with ∥Xβ -Xβ ′ ∥ 1 ≤ ε/(b hm h m ) it holds that |g 1 (β ′ ) -g 2 (β ′ )| ≤ 3ε. Proof. It holds that ∥X ′ (β -β ′ )∥ 1 = ∥SX(β -β ′ )∥ 1 ≤ b hm h m ∥X(β -β ′ )∥ 1 ≤ ε since for each i ∈ [n] there are at most h m columns j such that S ij ̸ = 0 and each entry of S is bounded by b hm . Also note that ∥g i (v) -g i (v ′ )∥ 1 ≤ ∥v -v ′ ∥ 1 holds for any two vectors v, v ′ ∈ R d+1 . Thus, using the triangle inequality yields  |g 1 (β ′ ) -g 2 (β ′ )| ≤ |g 2 (β ′ ) -g 2 (β)| + |g 2 (β) -g 1 (β)| + |g 1 (β) -g 1 (β ′ )| ≤ ε + ε + ε ≤ 3ε. ′ ∈ R d+1 it holds that |g 1 (β ′ ) -g 2 (β ′ )| ≤ 3εg 1 (β ′ ). Proof. We set N = β = v • ε db hm h m | v ∈ Z d with ∥v∥ ∞ ≤ db hm h m ε . Then it holds that for any β ∈ R d+1 with g 1 (β) = 1 the point (X, -y) β ′ = ⌊ db hm hm ε • (X, -y)β)⌋ • ε db hm hm is in N and it holds that ∥(X, -y)β ′ ∥ 1 ≤ d • ε db hm hm = ε b hm hm . Using Lemma D.5 it holds that |g 1 (β ′ ) -g 2 (β ′ )| ≤ 3ε ≤ 3εg 1 (β). Further we have |N | ≤ db hm hm ε 2d = exp (O(d ln(n)). Now for any r ∈ R and β ∈ R d+1 with g 1 (β) = 1 we have that |g 1 (rβ) -g 2 (rβ)| = |rg 1 (β) - rg 2 (β)| = r|g 1 (β) -g 2 (β)| ≤ 3εr.

E SKETCHING VARIANCE-BASED REGULARIZED LOGISTIC REGRESSION

In this section we show that our algorithm also approximates the variance well under the assumption that roughly f 1 (Xβ) ≤ ln(2). We stress that this assumption does not rule out the existence of good approximations. Indeed, even the minimizer is contained as observed in the preliminaries, since we have that min β∈R d f (Xβ) ≤ f (0) = f 1 (0) = ln(2). Again we focus on a single z = Xβ first. What remains to show is that i:zi>0 z 2 i is approximated well. We set H(z) = n i=1 z 2 i , H + (z) = i:zi>0 z 2 i and h(y) = y 2 H + (z) . By µ-complexity we get that H + (z) ≥ H(z) µ . We define W 2 q = {i ∈ [n] | h(z i ) ∈ (2 -q-1 , 2 q ]} and W 1 q = {i ∈ [n] | zi ∥z∥1 ∈ (2 -q-1 , 2 q ]}. As the argument since all X i are bounded by γ 2 by assumption. Applying Bernstein's inequality thus yields P (G(B) > 0) ≤ P   i∈[n] X i -E ′ ≥ ε|E ′ |   ≤ exp -ε 2 |E ′ | 2 /2 γ 2 • M/(nN ) + εγ 2 |E ′ |/3 ≤ exp -ε 3 • M/(nN )/2 γ 2 (M/(nN E ′ ) + ε/3) = exp -ε 3 • M/(nN )/2 γ 2 ε -1 ((A ′ -A 1 ) + 1/3) ≤ exp -ε 4 • M/(nN ) 2γ 2 ≤ exp -ln N h max δ = δ N h max . Note that εE ′ ≤ ε • M nN • (A ′ -A 1 ) ≤ ε • M nN • A and thus E( i∈[n] X 2 i ) + εE ′ ≤ M nN • (-A ′ + A 1 + εA) ≤ M nN • (-A 2 ). Our main lemma thus changes to: Lemma E.4. With probability at least 1 -δ hm the weight classes W 2 q for q ≥ q (M,N ) (4) := log 2 (γ -1 2 ) := log 2 ( 2N n ln(N hm/δ) M ε 4 ) and q ≤ q (M,N ) (1) := log 2 ( nδ M hm ) have zero contribution to B G + (B), i.e., for any bucket B we have zi∈B\Ir z i ≤ 0 where I r = {i ∈ [n] | z i ∈ W q , q ∈ [q (M,N ) (1), q (M,N ) (4)]}. Further, with failure probability at most exp(-m 1 ), for each log 2 ( 8qmµm1n ε 3 M )) =: q (M,N ) (2) ≤ q ≤ q (M,N ) (3) := log 2 ( N nε 2 4M m1 √ n ) there exists W * q such that i∈W * q G(B i ) ≥ (1 -ε) 2 ∥W 2 q ∥ 2 • M n . Thus it holds that: q (M,N ) (2) -q (M,N ) (1) = log 2 8q m m 1 h m ε 3 δ q (M,N ) (3) -q (M,N ) (2) = log 2 N ε 5 32m 1 µq m √ n =: log 2 (b) q (M,N ) (4) -q (M,N ) (3) = log 2 8 ln(N h m /δ √ n) ε 6 . If N = M then we set q (M,N ) (3) = q (M,N ) (4) = ∞. If M = n then we set set q (M,N ) (1) = q (M,N ) (2) = 0. We set q h (i) = q (M h ,N h ) (i) for i ∈ {1, 2, 3, 4} and Q h = [q h (2), q h ( )] to be the well-approximated weight classes and R h = [q h (1), q h (4)] to be the relevant weight classes at level h. Note that q (M,N ) (1) and q (M,N ) (2) stay the same as before. Heavy hitters The important changes to note here are that we need to replace Lemma C.5 with an appropriate lemma for the ℓ 2 -leverage scores and there is an additional factor of 1 √ n in γ 4 . We further redefine u p to be the ℓ 2 -leverage scores. Lemma E.5. (Clarkson & Woodruff, 2015a) If u i is the k-th largest ℓ 2 -leverage score, then for z in the subspace spanned by the columns of A it holds that z 2 i ≤ d k n j=1 z 2 j . Further it holds that n i=1 u i = d The lemma follows as in the case of ℓ 1 leverage scores by using an orthonormal basis. We then apply Lemma C.6 and Lemma E.5 as before: set N 1 = dγ -1 3 and N 2 = dγ -1 3 • γ -1 4 , where γ 3 = ε 3 8qmµm1 and γ 4 = 2ε √ nm1 . Further let Y 3 (resp. Y 4 ) be the set of coordinates with the N 1 (resp. N 2 ) largest leverage scores. We denote by E 2 the event that all coordinates in Y 3 are in a bucket with no other member of Y 4 . By Lemma C.6, E 2 holds with probability at least 1 -δ for an appropriate 

G ADDITIONAL MATERIAL

Comment on stochastic gradient descent (SGD) and supporting experiments SGD does not work in the turnstile data stream setting (where positive and negative updates are allowed), and is inherently sequential. In contrast, oblivious linear sketching allows for simple handling of turnstile data streams and distributed or parallel computations, which we motivate in our paper. Our bounds are multiplicative error guarantees that are relative to the optimal loss f (Xβ OP T ). Known regret or generalization results for SGD bound the probability of misclassification P (y i x i β < 0) < ε, which allows to ignore a few highly important (and expensive) points and give an additive error on the loss |f (Xβ SGD ) -f (Xβ OP T )| ≤ B. Here B depends on properties of f , X, and on the distance of an initial guess β 0 to the optimal solution ∥β 0 -β OP T ∥. Thus B cannot be charged uniformly for the optimal loss, and instead can be arbitrarily large. The reason SGD and online gradient descent do not work in our setting is that they miss (in most iterations) highly important points when there are only few of them (this is also the issue with uniform sampling). This was pointed out in previous related work, e.g., (Munteanu et al., 2018, Section C) and (Munteanu et al., 2021, Section 6) , who constructed synthetic data with only 2 out of n such important points and additionally demonstrated empirically how bad SGD can perform even on mild data with µ = 1. Below we add SGD to our empirical results on real world data. SGD performs quite well, though not better than sketching. The reported performance of SGD is the median approximation ratio over 21 independent repetitions of one full pass over the data, to be a baseline comparable to the sketches. We note that plotting the iteration-wise error would make the results for SGD look much worse. For our synthetic data (described in detail below), instead of 2 out of n heavy points (as in previous work), we have Θ(d) out of n heavy points. Since SGD misses those in most batches, the instance looks separable to SGD in almost all iterations, although the original instance is inseparable. This results in approximation ratios around 15 000 (note the logarithmic scale on the vertical axis). In contrast, our sketch and the previous sketch give small constant approximations. Figure 4 : Median approximation ratios for plain logistic regression (λ = 0). SGD is compared to the old sketch as well as the new sketch with various settings for the sparsity s ∈ {2, 5, 10} on different real-world benchmark data, and on our synthetic data. Added experiments for ℓ 1 -regression We implemented the Cauchy sketch (Indyk, 2006; Sohler & Woodruff, 2011; Woodruff, 2021) that simply consists of i.i.d. standard Cauchy entries. The sketching matrix is then multiplied by the data matrix and the sketched ℓ 1 regression problem is solved. The plots show the median approximation ratio over 21 repetitions for each target size of the sketch. We see that the new sketch, using any degree of sparsity s ∈ {2, 5, 10}, outperforms the Cauchy sketch by a large margin in terms of approximation factor (while being a lot faster to apply than the dense matrix multiplication). • There are (n -n/10 -2d) points of the form (-1, -1, ..., -1). The optimization of β focuses on these points. • (n/10) points are of the form (1, 1, ..., 1). These points are only added to obtain clean plots. If they are omitted, then the gap between the optimal β on the original data and the optimal β on a bad sketch or uniform sample becomes worse, and "wiggly". • d points are of the form (-n, -n, . . . , -n). They are oriented in the same direction as the first set of points. They will mostly be ignored in the optimization since there are only a few of them, but they can cancel the following heavy hitters. • For each i ∈ [d], we add one vector of the form (n • e i ), where e i is ith standard basis vector. These points are the heavy hitters pointing away from most other points. For any good relative error sketch, it is crucial to preserve all of them; • We add n times the point (0, 0, . . . , 0). These points are needed to ensure that the instance works in a labeled setting (to be a natural data set for logistic regression). • The labels of all points unequal to the all zero vector are set to 1. All zero vectors are assigned the label -1. The idea behind this instance is as follows: if the sketch maps any point of the form (n • e i ) into the same bucket as a (-n, -n, . . . , -n)-vector, then the instance will become almost separable, so the sketch will have a cheap solution, meaning that there exists some β such that the logistic loss on the sketch is low. However, on the original instance, the logistic loss of the same β will be large due to the loss associated with (n • e i )β. This implies a large approximation ratio. For small sketch sizes, the old sketch has a relatively high probability for this bad event to happen, when hashing each point into a single bucket. Our new sketch will likely preserve all points of the form (n • e i ) on most of the multiple sub levels. This preserves the cost even if our sketch size is small (almost linear). Pseudocode of our sketching algorithm Algorithm 1 implements the first step of the sketch & solve paradigm for approximating logistic or ℓ 1 regression. The changes in comparison with (Munteanu et al., 2021) are highlighted in red. Algorithm 1 Oblivious sketching algorithm for logistic regression. add x i to the B i -th row of X ′ h ; 12: add x i to uniform sampling level h m with probability p hm = 1 b hm ; 13: Set X ′ = (X ′ 0 , X ′ 1 , . . . X ′ hm ); 14: Set w = (w 0 , w 1 , . . . w hm ); 15: return C = (X ′ , w);



The tilde notation suppresses any polylog( µdn εδ ) even if no higher order terms appear. This allows us to focus on the main parameters and their improvement. The exact terms are specified in Theorems 1-3. by an argument in the proof of Lemma 7 ofSohler & Woodruff (2011), cf.(Woodruff, 2021, Problem 1). The exact constant was not specified but overcounting their parameters gives at best an ≥ 8-approximation. i.e., the coordinates of z with largest ℓ1 leverage score code available at https://github.com/Tim907/oblivious_sketching_varreglogreg



Figure 2: Illustration of Lemma C.10 and Lemma C.11.

2') Before proving the second part we show that with r = O( µ 2 d 4 ln(n) 7 ε 12 δ

There exists a net N ⊂ R d with |N | = exp (O(d ln(n))) such that if |g 1 (β)-g 2 (β)| ≤ εg 1 (β) holds for any β ∈ N then for any β

Figure3: Comparison the median approximation ratios of the old sketch versus the new sketch with various settings for the sparsity s ∈ {2, 5, 10} as well as for the regularization parameter λ ∈ {0, .1, .5, 1} for different real-world benchmark data (top, middle). Comparison of median approximation ratios (bottom left) and sketching times (bottom right) for our synthetic data.

Figure 5: Comparison of the median approximation ratios for ℓ 1 -regression of the Cauchy sketch versus our new sketch with various settings for the sparsity s ∈ {2, 5, 10} for different real-world benchmark data.

Data X ∈ R n×d , number of rows k = N • h m + N u , parameters b > 1, s ≥ 1 where N = s • N ′ for some N ′ ∈ N; Output: weighted Sketch C = (X ′ , w) ∈ R k×d with k rows.; 1: for h = 0 . . . h m do ▷ construct levels 0, . . . h m of the sketch 2: initialize sketch X ′ h = 0 ∈ R N ×d at level h; 3: initialize weights w h = b h • 1 ∈ R N at level h; 4: set w 0 = w0 s ;▷ adapt weights on level 0 to sparsity s 5: for i = 1 . . . n do ▷ sketch the data 6:for l = 1 . . . s do ▷ densify level 07: draw a random number B i ∈ [N ′ ]; 8: add x i to the ((l -1) • N ′ + B i )-th row of X ′ 0 ; 9: assign x i to level h ∈ [1, h m -1] with probability p h = 1 b h ; 10:draw a random number B i ∈ [N ];11:

ACKNOWLEDGEMENTS

We thank the anonymous reviewers for their valuable comments. We thank Tim Novak for helping with the experiments. Alexander Munteanu & Simon Omlor were supported by the German Research Foundation (DFG), Collaborative Research Center SFB 876, project C4 and by the Dortmund Data Science Center (DoDSc). David P. Woodruff was partially supported by a Simons Investigator Award and by the Office of Naval Research (ONR) grant N00014-18-1-2562.

annex

is almost the same as in the section before, we will only note the differences. We will also use the same definition of importance, i.e., a weight class W 2 q is important if H + (W 2 q ) ≥ ε qmµ . Similar to the previous analysis we have that if W 2 q is important then |W 2 q | ≥ ε2 q qmµ . With those adapted definitions we proceed by adapting the main lemmas of Section C that finally yield Theorem 3. Lemma E.1. For any z i ∈ W 2 q there exists q ′ ≤ (q -1)/2 + ln(n)/2 such that z i ∈ W 1 q ′ .Proof. It is well known that ∥z∥ 1 ≤ √ n∥z∥ 2 . We conclude thatNow taking the logarithm proves the lemma.Contraction bounds Recall that:Here Y 1 is the set of 'large elements'. We redefine µ z = z i >0 z 2 i z i <0 z 2 i Lemma E.2. The following hold:1) |Y 1 ∩ U | ≤ εN/2 with failure probability at most exp(-m 1 );Then with failure probability at most exp(-m 1 ) there exists W * q ⊂ W 2 q ∩ B such that) and W * q as in 3) exists, then with failure probability at most exp(-m 1 )The proof is verbatim to the proof of Lemma C.3. For the 4th part we use Lemma E.1 to reduce the problem to the weight class W 1 q . This causes an additional term of 1 √ n in the logarithm of q 3 (M, N ). We also have a change in q 4 (M, N ). More precisely we need two additional factors of ε in γ 2 :with failure probability at most δ N hmax .Proof. Let X i be the random variable attaining value z i if i ∈ B and 0 otherwise, for i ∈. Further we have that). For any entry z p ∈ H we have z p ≥ γ 3 and thus by Lemma C.5, we have p ∈ Y 3 and for any entry p / ∈ Y 4 we have z p < γ 3 • γ 4 . It remains to show that the remaining entries in the buckets containing a heavy hitter only have a small contribution. To this end we use Bernstein's inequality. For a coordinate p ∈ [n] we denote by B p the bucket at level 0 that contains p. Lemma E.6. Assume E 2 holds. Then for anyContraction bounds for a single point Lemma E.7. Assume that E 2 holds. Denote by z ′ i the i-th row of SXβ for i ∈ n ′ . Then with failure probability at most (2h m + 2q m )e -m1 it holds thatHere the constant before the ε increases for the following reason: assume that for someDilation bounds Here we have to cope with the additional factor of √ n. Recall that if we choose M i solving the equation qWe now haveFurther there is a change in LemmaC.10 as we have to deal with possible overhead coming from the square function. We set R = {i|2 -q h (1) > z i > 2 -q h (4) } to be the set of relevant (positive) elements andProof. Fix a level h and a bucket B at level h.

N

. Let Z i be the random variable whereLemma C.11 and Lemma C.12 can be adapted as follows:Lemma E.9. If we choose} for all i ∈ [h m ] and M i solving the equation q (Mi-1,N ) (3) = q (Mi,N ) (2) then the expected contribution of any weight class W q is at most 2∥W q ∥ 1 .Proof. The proof uses a different idea as before: since N is large enough, we only need 2 levels. More precisely we want to achieve M 2 = N . By our choice of M 2 this meansfor all i ∈ [h m ] and M i solving the equation q (Mi-1,N ) (3) = q (Mi,N ) (2) then the expected contribution of any weight class W q is at most k∥W q ∥ 1 .The proof is the same as for Lemma C.12.which proves the lemma.Lemma E.12. Assume that forthere are at most h m columns j such that S ij ̸ = 0 and each entry of S is bounded by b hm . Thus, by the triangle inequality and applying Lemma E.11 yieldsCombining Lemma C.16 and Lemma E.12 we get: Lemma E.13. There exists a netProof of Theorem 3 The proof of Theorem 3 works as the proof of Theorem 2, replacing the old lemmas with the new ones.

E.1 LOWER BOUND

We note that the increased sketching dimension in terms of √ n comes from the inter norm inequality ∥x∥ 1 ≤ √ n∥x∥ 2 and from more subtle details of the sketch. Lemma E.14 shows that there is no way to get around a factor of √ n using the CountMin-sketch. The proof gives an example where √ n is attained even for obtaining a superconstant (in µ) approximation. It does not rule out the existence of some other method that allows a lower sketching dimension. For example Count-sketch is known to work for ℓ 1 and ℓ 2 norms simultaneously within polylogarithmic size (Clarkson & Woodruff, 2015a) . But we stress that the standard sketches from the literature do not work for asymmetric functions since they confuse the signs of contributions leading to unbounded errors for our objective function or even for plain logistic regression, see (Munteanu et al., 2021) . Lemma E.14. There exists a µ-complex data example X where our sketch with o( √ n) rows fails to approximate f . Specifically, if λ = 1 it holds for the optimizerProof. Fix µ > 10 and consider the following dataAs the example is 1-dimensional we only need to check the ratio for β = 1 and β = -1 in order to compute µ as multiplying with a scalar does dot not change the ratio between the sum of all positive points and the sum of all negative points. Also note that the ratio is inverted for β = -1 thus if the ratio is positive for β = 1 we do not need to check it for β = -1. Note that for β = 1 and z = Xβ it holds that zi>0 zFurther we have that zi>0Since d = 1 this proves that our our example is 2µ-complex. Note that the following four facts hold for any level h:• If for some c we have that p h ≤ 1/b then with probability 1/b row x 0 is not sampled at level h. In particular this implies that x 0 is only present at level 0 with high probability, i.e. probability at leastelements then with high probability G(B 0 ) ≤ 0;• If N h n ≪ p h ≪ 1 then with high probability G(B) < 0 for any bucket at level h since the µ-1 µ • n ≫ n µ negative elements cancel all positive rows;• If h = h m then roughly µ-1 µ • N u are -1 and Nu µ are 1.All of these follow from the Chernoff bounds using Lemma A.3. Thus if N 0 ≪ √ n/3 then X ′ = SX mimics the instance X \ {x 0 }, i.e. the instance X with point x 0 removed, as x 0 is only appearing at level 0 where it is canceled by the other points. More precisely X ′ consists of roughly n ′ -n ′ µ copies of the point -1 and n ′ µ copies of the point 1. After multiplying with the weights we are back to roughly n -n µ times the point -1 and n µ times the point 1. To keep the presentation simple we only consider the instance X ′ = -(X \ {x 0 }). The proof works the same for other sketched instances that we obtain using the above facts. Consider the functionThus, we have f 1 (X ′ r) = ℓ(-r) + r µ . Using that ℓ(r) < r + 1 for all r > 0 we get nf (X ′ r) = nf 1 (X ′ r) + Published as a conference paper at ICLR 2023 Using that ℓ(-r) ≤ e -r it holds that f (X ′ r) ≤ 2e -r + r µ + (r + 1) 2 µ .Taking the derivative we get f ′ (X ′ r) ≤ -2e -r + 1 µ + 2(r + 1) µ .which is 0 if and only if r = -ln( 2r+3 µ ) + ln(2) = Ω(ln(µ)). This implies that for r = argmin r∈R f (Xr) we have that r = Ω(ln(µ)). Now consider our original loss function f (X r).Here we have that nf (Xr) = nf 1 (Xr) + n i=1In particular we have that f (X r) = Ω(ln(µ) 2 ). However for r * minimizing f (Xr) we have that nf (Xr) ≤ f (0) = ln(2) = O(1).

