LOW COMPLEXITY APPROXIMATE BAYESIAN LOGIS-TIC REGRESSION FOR SPARSE ONLINE LEARNING

Abstract

Theoretical results show that Bayesian methods can achieve lower bounds on regret for online logistic regression. In practice, however, such techniques may not be feasible especially for very large feature sets. Various approximations that, for huge sparse feature sets, diminish the theoretical advantages, must be used. Often, they apply stochastic gradient methods with hyper-parameters that must be tuned on some surrogate loss, defeating theoretical advantages of Bayesian methods. The surrogate loss, defined to approximate the mixture, requires techniques as Monte Carlo sampling, increasing computations per example. We propose low complexity analytical approximations for sparse online logistic and probit regressions. Unlike variational inference and other methods, our methods use analytical closed forms, substantially lowering computations. Unlike dense solutions, as Gaussian Mixtures, our methods allow for sparse problems with huge feature sets without increasing complexity. With the analytical closed forms, there is also no need for applying stochastic gradient methods on surrogate losses, and for tuning and balancing learning and regularization hyper-parameters. Empirical results top the performance of the more computationally involved methods. Like such methods, our methods still reveal per feature and per example uncertainty measures.

1. INTRODUCTION

We consider online (Bottou, 1998; Shalev-Shwartz et al., 2011) binary logistic regression over a series of rounds t ∈ {1, 2, . . . , T }. At round t, a sparse feature vector x t ∈ [-1, 1] d with d t d nonzero values, is revealed, and a prediction for the label y t ∈ {-1, 1} must be generated. The dimension d can be huge (billions), but d t is usually tens or hundreds. Logistic regression is used in a huge portion of existing learning problems. It can be used to predict medical risk factors, to predict world phenomena, stock market movements, or click-through-rate in online advertising. The online sparse setup is also very common to these application areas, particularly, if predictions need to be streamed in real time as the model keeps updating from newly seen examples. A prediction algorithm attempts to maximize probabilities of the observed labels. Online methods sequentially learn parameters for the d features. With stochastic gradient methods (Bottou, 2010; Duchi et al., 2011) , these are weights w i,t associated with feature i ∈ {1, • • • , d} at round t. Bayesian methods keep track of some distribution over the parameters, and assign an expected mixture probability to the generated prediction (Hoffman et al., 2010; Opper & Winther, 1998) . The overall objective is to maximize a sequence likelihood probability, or to minimize its negative logarithm. A benchmark measure of an algorithm's performance is its regret, the excess loss it attains over an algorithm that uses some fixed comparator values of w * = (w 1 , w 2 , . . . , w d ) T (T denoting transpose). A comparator w * that minimizes the cumulative loss can be picked to measure the regret relative to the best possible comparator in some space of parameter values. Kakade & Ng (2005); Foster et al. (2018) ; Shamir (2020) demonstrated that, in theory, Bayesian methods are capable to achieve regret, logarithmic with the horizon T and linear with d, that even matches regret lower bounds for d = o(T ). Classical stochastic gradient methods are usually implemented as proper learning algorithms, that determine w t prior to observing x t , and are inferior in the worst-case (Hazan et al., 2014) , although, in many cases depending on the data, they can still achieve logarithmic regret (Bach, 2010; 2014; Bach & Moulines, 2013) . Recent work (Jézéquel et al., 2020) demonstrated non-Bayesian improper gradient based algorithms with better regret.

