ALMOST LINEAR CONSTANT-FACTOR SKETCHING FOR ℓ 1 AND LOGISTIC REGRESSION

Abstract

We improve upon previous oblivious sketching and turnstile streaming results for ℓ 1 and logistic regression, giving a much smaller sketching dimension achieving O(1)approximation and yielding an efficient optimization problem in the sketch space. Namely, we achieve for any constant c > 0 a sketching dimension of Õ(d 1+c ) for ℓ 1 regression and Õ(µd 1+c ) for logistic regression, where µ is a standard measure that captures the complexity of compressing the data. For ℓ 1 -regression our sketching dimension is near-linear and improves previous work which either required Ω(log d)-approximation with this sketching dimension, or required a larger poly(d) number of rows. Similarly, for logistic regression previous work had worse poly(µd) factors in its sketching dimension. We also give a tradeoff that yields a 1 + ε approximation in input sparsity time by increasing the total size to (d log(n)/ε) O(1/ε) for ℓ 1 and to (µd log(n)/ε) O(1/ε) for logistic regression. Finally, we show that our sketch can be extended to approximate a regularized version of logistic regression where the data-dependent regularizer corresponds to the variance of the individual logistic losses.

1. INTRODUCTION

We consider logistic regression in distributed and streaming environments. A key tool for solving these problems is a distribution over random oblivious linear maps S ∈ R r×n which have the property that, for a given n × d matrix X, where we assume the labels for the rows of X have been multiplied into X, given only SX one can efficiently and approximately solve the logistic regression problem. The fact that S does not depend on X is what is referred to as S being oblivious, which is important in distributed and streaming tasks since one can choose S without first needing to read the input data. The fact that S is a linear map is also important for such tasks, since given SX (1) and SX (2) , one can add these to obtain S(X (1) + X (2) ), which allows for positive or negative updates to entries of the input in a stream, or across multiple servers in the arbitrary partition model of communication, see, e.g., (Woodruff, 2014) for a discussion of data stream and communication models. An important goal is to minimize the sketching dimension r of the sketching matrix S, as this translates into the memory required of a streaming algorithm and the communication cost of a distributed algorithm. At the same time, one would like the approximation factor that one obtains via this approach to be as small as possible. Specifically we develop and improve oblivious sketching for the most important robust linear regression variant, namely ℓ 1 regression, and for logistic regression, which is a generalized linear model of high importance for binary classification and estimation of Bernoulli probabilities. Sketching supports very fast updates which is desirable for performing robust and generalized regression in high-velocity data processing applications, for instance in physical experiments and other resource constraint settings, cf. (Munteanu et al., 2021; Munteanu, 2023) . We focus on the case where the number n of data points is very large, i.e., n ≫ d. In this case, applying a standard algorithm directly is not a viable option since it is either too slow or even becomes impossible when it requires more memory than we can afford. Following the sketch & solve paradigm (Woodruff, 2014), our goal is in a first step to reduce the size of the data without losing too much information. Then, in a second step, we approximate the problem efficiently on the reduced data.

Sketch & solve principle:

1. Calculate a small sketch SX of the data X. 2. Solve the problem β = argmin β f (SXβ) using a standard optimization algorithm. The theoretical analysis proves that the sketch in the first step is calculated in such a way that the solution obtained in the second step is a good approximation to the original problem, i.e., that f (X β) ≤ C • argmin β f (Xβ) holds for a small constant factor C ≥ 1.

1.1. OUR CONTRIBUTIONS

For logistic regression our goal is to achieve an O(1)-approximation with an efficient estimator in the sketch space and smallest possible sketching dimension in terms of µ and d, where µ = µ(X) = sup β̸ =0 x i β>0 |xiβ| x i β<0 |xiβ| is a data dependent parameter that captures the complexity of compressing the data for logistic regression, see Definition 2.1. As a byproduct of our algorithms, we also obtain algorithms for ℓ 1 -regression. We note that the parameter µ is necessary only for logistic regression, i.e., for sketching ℓ 1 -regression, we set µ = 1. We summarize our contributions as follows: 1) We significantly improve the sketch of Munteanu et al. (2021) . More precisely we show with minor modifications in their algorithm but major modifications in the analysis that the size of the sketch can be reduced from roughly Õ(µ 7 d 5 )foot_0 to Õ(µd 1+c ) for any c > 0, while preserving an O(1) approximation to either the logistic or ℓ 1 loss. 2) We show that increasing the sketching dimension to (µd log(n)/ε) O(1/ε) is sufficient to obtain a 1 + ε approximation guarantee. 3) We show that our sketch can also approximate variance-based regularized logistic regression within an O(1) factor if the dependence on n in the sketching dimension is increased to n 0.5+c for any c > 0. We also give an example corroborating that the CountMin-sketch that we use needs at least Ω(n 0.5 ) rows to achieve an approximation guarantee below log 2 (µ).

1.2. RELATED WORK

Data oblivious sketching Data oblivious sketches have been developed for many problems in computer science, see (Phillips, 2017; Munteanu, 2023) for surveys. The seminal work of Sarlós (2006) opened up the toolbox of sketching for numerical linear algebra and machine learning problems, such as linear regression and low rank approximation, cf. (Woodruff, 2014) . We note that oblivious sketching is very important to obtain data stream algorithms in the turnstile model (Muthukrishnan, 2005) and there is evidence that linear sketches are optimal for such algorithms under certain conditions (Li et al., 2014; Ai et al., 2016) . The classic works on ℓ 2 regression have been generalized to other ℓ p norms (Sohler & Woodruff, 2011; Woodruff & Zhang, 2013) by combining sketching as a fast but inaccurate preconditioner and subsequent sampling to achieve the desired (1 + ε)-approximation bounds. Those works have been generalized further to so-called M -estimators, i.e., Huber (Clarkson & Woodruff, 2015a) or Tukey regression loss (Clarkson et al., 2019) , that share nice properties such as symmetry and homogeneity leveraged in previous works on ℓ p norms. ℓ 1 regression Specifically for ℓ 1 , the first sketching algorithms used random variables drawn from 1-stable (Cauchy) distributions to estimate the norm (Indyk, 2006) . It is possible to get concentration and a (1 ± ε)-approximation in near-linear space by using a median estimator. However, in a regression setting this estimator leads to a non-convex optimization problem in the sketch space. Since we want to preserve convexity to facilitate efficient optimization in the sketch space, we focus on sketches that work with an ℓ 1 estimator for solving the ℓ 1 regression problem in the sketch space in order to obtain a constant approximation for the original ℓ 1 problem. With this restriction, it is possible to obtain a contraction bound with high probability so as to union bound over a net, but similar results are not available for the dilation. Indeed, subspace embeddings for the ℓ 1 norm have



The tilde notation suppresses any polylog( µdn εδ ) even if no higher order terms appear. This allows us to focus on the main parameters and their improvement. The exact terms are specified in Theorems 1-3.

