LOW COMPLEXITY APPROXIMATE BAYESIAN LOGIS-TIC REGRESSION FOR SPARSE ONLINE LEARNING

Abstract

Theoretical results show that Bayesian methods can achieve lower bounds on regret for online logistic regression. In practice, however, such techniques may not be feasible especially for very large feature sets. Various approximations that, for huge sparse feature sets, diminish the theoretical advantages, must be used. Often, they apply stochastic gradient methods with hyper-parameters that must be tuned on some surrogate loss, defeating theoretical advantages of Bayesian methods. The surrogate loss, defined to approximate the mixture, requires techniques as Monte Carlo sampling, increasing computations per example. We propose low complexity analytical approximations for sparse online logistic and probit regressions. Unlike variational inference and other methods, our methods use analytical closed forms, substantially lowering computations. Unlike dense solutions, as Gaussian Mixtures, our methods allow for sparse problems with huge feature sets without increasing complexity. With the analytical closed forms, there is also no need for applying stochastic gradient methods on surrogate losses, and for tuning and balancing learning and regularization hyper-parameters. Empirical results top the performance of the more computationally involved methods. Like such methods, our methods still reveal per feature and per example uncertainty measures.

1. INTRODUCTION

We consider online (Bottou, 1998; Shalev-Shwartz et al., 2011) binary logistic regression over a series of rounds t ∈ {1, 2, . . . , T }. At round t, a sparse feature vector x t ∈ [-1, 1] d with d t d nonzero values, is revealed, and a prediction for the label y t ∈ {-1, 1} must be generated. The dimension d can be huge (billions), but d t is usually tens or hundreds. Logistic regression is used in a huge portion of existing learning problems. It can be used to predict medical risk factors, to predict world phenomena, stock market movements, or click-through-rate in online advertising. The online sparse setup is also very common to these application areas, particularly, if predictions need to be streamed in real time as the model keeps updating from newly seen examples. A prediction algorithm attempts to maximize probabilities of the observed labels. Online methods sequentially learn parameters for the d features. With stochastic gradient methods (Bottou, 2010; Duchi et al., 2011) , these are weights w i,t associated with feature i ∈ {1, • • • , d} at round t. Bayesian methods keep track of some distribution over the parameters, and assign an expected mixture probability to the generated prediction (Hoffman et al., 2010; Opper & Winther, 1998) . The overall objective is to maximize a sequence likelihood probability, or to minimize its negative logarithm. A benchmark measure of an algorithm's performance is its regret, the excess loss it attains over an algorithm that uses some fixed comparator values of w * = (w 1 , w 2 , . . . , w d ) T (T denoting transpose). A comparator w * that minimizes the cumulative loss can be picked to measure the regret relative to the best possible comparator in some space of parameter values. Kakade & Ng (2005); Foster et al. (2018) ; Shamir (2020) demonstrated that, in theory, Bayesian methods are capable to achieve regret, logarithmic with the horizon T and linear with d, that even matches regret lower bounds for d = o(T ). Classical stochastic gradient methods are usually implemented as proper learning algorithms, that determine w t prior to observing x t , and are inferior in the worst-case (Hazan et al., 2014) , although, in many cases depending on the data, they can still achieve logarithmic regret (Bach, 2010; 2014; Bach & Moulines, 2013) . Recent work (Jézéquel et al., 2020) demonstrated non-Bayesian improper gradient based algorithms with better regret. Unfortunately, superiority of Bayesian methods diminishes by their intractability. A theoretically optimal prior has a diagonal covariance matrix, with each component either uniformly or Normally distributed with large variance. Effects of such a prior cannot be maintained in practical online problems with a large sparse feature set, as the posterior of such a prior no longer has the properties of the prior, but must be maintained as a subsequent prior. Gaussian approximations that rely on diagonalization of the covariance must be used. Neither normal nor the diagonal assumptions are true for the real posterior (even with diagonal prior). They thus lead to performance degradations. Diagonalization is similar to linearization in convex optimization, specifically for stochastic gradient descent (SGD) (Zinkevich, 2003) . It allows handling features independently, but limits performance. Bayesian learning literature focused on applying such methods to predict posterior probabilities, and provide model (epistemic) uncertainty measurements (Bishop, 2006; Dempster, 1968; Huelsenbeck & Ronquist, 2001; Knill & Richards, 1996) . However, uncertainty of a feature is, in fact, mostly a function of the number of examples in which it was present; a measure that can be tracked, not estimated. Methods, such as Variational Bayesian (VB) Inference (Bishop, 2006; Blei et al., 2017; Drugowitsch, 2019; Drugowitsch et al., 2019; Ranganath et al., 2014) , track such measurements by matching the posterior. However, as demonstrated in Rissanen (1984) seminal work, minimizing regret is identical to uncertainty reduction, as regret is merely a different representation of uncertainty. Regret can be universally minimized over the possible parameter space through a good choice of a prior. Hence, to minimize uncertainty, the role of an approximation is to preserve the effect of such a prior at least in the region of the distribution that dominates the ultimate posterior at the horizon T . This is a simpler problem than matching the posterior, and opens possibilities for simpler approximations that can lead to results identical to those of heavy methods as VB. VB methods are typically used offline to match a tractable posterior to the true one by upper bounding overall loss. They are computationally involved, requiring either iterative techniques (Bishop, 2006; Blei et al., 2017; Murphy, 2012) like Expectation Maximization (EM) (Dempster et al., 1977; Moon, 1996) ; or Monte Carlo (MC) sampling, replacing analytical expectation by an empirical one over a randomly drawn set. To converge, MC can be combined with gradient descent, either requiring heavy computations, or adapting stochastic gradient methods to update posteriors (Broderick et al., 2013; Knowles, 2015; Nguyen et al., 2017a; b) . For online problems, the posterior of an example is the prior of the subsequent one. To minimize losses, online VB must converge to the optimal approximation at every example. Otherwise, additional losses may be incurred, as the algorithm may not converge at each example, while the target posterior keeps moving with subsequent examples. Moreover, combining methods that need to tune hyper-parameters defeats the parameter free nature (Abdellaoui, 2000; Mcmahan & Streeter, 2012; Orabona & Pál, 2016) of Bayesian methods. Most Bayesian literature addressed the dense problem, where x t consists of mostly nonzero entries for every t, and the dimension d of the feature space is relatively small. Techniques, like Gaussian Mixtures (Herbrich et al., 2003; Montuelle et al., 2013; 2014; Rasmussen, 2000; Reynolds et al., 2000; Sung, 2004) , that may use VB, usually apply matrix computations quadratic in d on the covariance matrix. In many practical problems, however, a very small feature subset is present in each example. For categorical features, only one of the features in the vector is present at any example. Techniques, useful for the low dimensional dense problem, may thus not be practical. Paper Contributions: We provide a simple analytical Bayesian method for online sparse logistic and probit regressions with closed form updates. We generalize the method also for dense multidimensional updates, if the problem is not completely sparse. Our results are first to study regret for Bayesian methods that are simple enough to be applied in practice. They provide an example to the connection between uncertainty and regret, and more broadly the Minimum Description Length (MDL) principle (Grunwald, 2007; Rissanen, 1978a; b; 1984; 1986; Shamir, 2015; 2020) . Empirical results demonstrate the advantages of our method over computationally involved methods and over other simpler approximations, both by achieving better regret and better loss on real data. As part of the algorithm, uncertainty measures are provided with no added complexity. We specifically demonstrate that it is sufficient to have an approximation focus on the location of the peak of the posterior and its curvature or value, which are most likely to dominate regret, instead of approximating the full posterior, which brings unnecessary complexity missing the real goal of preserving the effects of a good prior. In fact, approximating the full posterior may eventually lead to poor generalization and overfitting by focusing too much on the tails of the posterior. Our approach directly approximates the posterior, unlike VB methods that approximate by minimizing an upper bound on the loss. Finally, our approach leverages sparsity to solve a sparse problem.

