LOW COMPLEXITY APPROXIMATE BAYESIAN LOGIS-TIC REGRESSION FOR SPARSE ONLINE LEARNING

Abstract

Theoretical results show that Bayesian methods can achieve lower bounds on regret for online logistic regression. In practice, however, such techniques may not be feasible especially for very large feature sets. Various approximations that, for huge sparse feature sets, diminish the theoretical advantages, must be used. Often, they apply stochastic gradient methods with hyper-parameters that must be tuned on some surrogate loss, defeating theoretical advantages of Bayesian methods. The surrogate loss, defined to approximate the mixture, requires techniques as Monte Carlo sampling, increasing computations per example. We propose low complexity analytical approximations for sparse online logistic and probit regressions. Unlike variational inference and other methods, our methods use analytical closed forms, substantially lowering computations. Unlike dense solutions, as Gaussian Mixtures, our methods allow for sparse problems with huge feature sets without increasing complexity. With the analytical closed forms, there is also no need for applying stochastic gradient methods on surrogate losses, and for tuning and balancing learning and regularization hyper-parameters. Empirical results top the performance of the more computationally involved methods. Like such methods, our methods still reveal per feature and per example uncertainty measures.

1. INTRODUCTION

We consider online (Bottou, 1998; Shalev-Shwartz et al., 2011) binary logistic regression over a series of rounds t ∈ {1, 2, . . . , T }. At round t, a sparse feature vector x t ∈ [-1, 1] d with d t d nonzero values, is revealed, and a prediction for the label y t ∈ {-1, 1} must be generated. The dimension d can be huge (billions), but d t is usually tens or hundreds. Logistic regression is used in a huge portion of existing learning problems. It can be used to predict medical risk factors, to predict world phenomena, stock market movements, or click-through-rate in online advertising. The online sparse setup is also very common to these application areas, particularly, if predictions need to be streamed in real time as the model keeps updating from newly seen examples. A prediction algorithm attempts to maximize probabilities of the observed labels. Online methods sequentially learn parameters for the d features. With stochastic gradient methods (Bottou, 2010; Duchi et al., 2011) , these are weights w i,t associated with feature i ∈ {1, • • • , d} at round t. Bayesian methods keep track of some distribution over the parameters, and assign an expected mixture probability to the generated prediction (Hoffman et al., 2010; Opper & Winther, 1998) . The overall objective is to maximize a sequence likelihood probability, or to minimize its negative logarithm. A benchmark measure of an algorithm's performance is its regret, the excess loss it attains over an algorithm that uses some fixed comparator values of w * = (w 1 , w 2 , . . . , w d ) T (T denoting transpose). A comparator w * that minimizes the cumulative loss can be picked to measure the regret relative to the best possible comparator in some space of parameter values. Kakade & Ng (2005) ; Foster et al. (2018) ; Shamir (2020) demonstrated that, in theory, Bayesian methods are capable to achieve regret, logarithmic with the horizon T and linear with d, that even matches regret lower bounds for d = o(T ). Classical stochastic gradient methods are usually implemented as proper learning algorithms, that determine w t prior to observing x t , and are inferior in the worst-case (Hazan et al., 2014) , although, in many cases depending on the data, they can still achieve logarithmic regret (Bach, 2010; 2014; Bach & Moulines, 2013) . Recent work (Jézéquel et al., 2020) demonstrated non-Bayesian improper gradient based algorithms with better regret. Unfortunately, superiority of Bayesian methods diminishes by their intractability. A theoretically optimal prior has a diagonal covariance matrix, with each component either uniformly or Normally distributed with large variance. Effects of such a prior cannot be maintained in practical online problems with a large sparse feature set, as the posterior of such a prior no longer has the properties of the prior, but must be maintained as a subsequent prior. Gaussian approximations that rely on diagonalization of the covariance must be used. Neither normal nor the diagonal assumptions are true for the real posterior (even with diagonal prior). They thus lead to performance degradations. Diagonalization is similar to linearization in convex optimization, specifically for stochastic gradient descent (SGD) (Zinkevich, 2003) . It allows handling features independently, but limits performance. Bayesian learning literature focused on applying such methods to predict posterior probabilities, and provide model (epistemic) uncertainty measurements (Bishop, 2006; Dempster, 1968; Huelsenbeck & Ronquist, 2001; Knill & Richards, 1996) . However, uncertainty of a feature is, in fact, mostly a function of the number of examples in which it was present; a measure that can be tracked, not estimated. Methods, such as Variational Bayesian (VB) Inference (Bishop, 2006; Blei et al., 2017; Drugowitsch, 2019; Drugowitsch et al., 2019; Ranganath et al., 2014) , track such measurements by matching the posterior. However, as demonstrated in Rissanen (1984) seminal work, minimizing regret is identical to uncertainty reduction, as regret is merely a different representation of uncertainty. Regret can be universally minimized over the possible parameter space through a good choice of a prior. Hence, to minimize uncertainty, the role of an approximation is to preserve the effect of such a prior at least in the region of the distribution that dominates the ultimate posterior at the horizon T . This is a simpler problem than matching the posterior, and opens possibilities for simpler approximations that can lead to results identical to those of heavy methods as VB. VB methods are typically used offline to match a tractable posterior to the true one by upper bounding overall loss. They are computationally involved, requiring either iterative techniques (Bishop, 2006; Blei et al., 2017; Murphy, 2012) like Expectation Maximization (EM) (Dempster et al., 1977; Moon, 1996) ; or Monte Carlo (MC) sampling, replacing analytical expectation by an empirical one over a randomly drawn set. To converge, MC can be combined with gradient descent, either requiring heavy computations, or adapting stochastic gradient methods to update posteriors (Broderick et al., 2013; Knowles, 2015; Nguyen et al., 2017a; b) . For online problems, the posterior of an example is the prior of the subsequent one. To minimize losses, online VB must converge to the optimal approximation at every example. Otherwise, additional losses may be incurred, as the algorithm may not converge at each example, while the target posterior keeps moving with subsequent examples. Moreover, combining methods that need to tune hyper-parameters defeats the parameter free nature (Abdellaoui, 2000; Mcmahan & Streeter, 2012; Orabona & Pál, 2016) of Bayesian methods. Most Bayesian literature addressed the dense problem, where x t consists of mostly nonzero entries for every t, and the dimension d of the feature space is relatively small. Techniques, like Gaussian Mixtures (Herbrich et al., 2003; Montuelle et al., 2013; 2014; Rasmussen, 2000; Reynolds et al., 2000; Sung, 2004) , that may use VB, usually apply matrix computations quadratic in d on the covariance matrix. In many practical problems, however, a very small feature subset is present in each example. For categorical features, only one of the features in the vector is present at any example. Techniques, useful for the low dimensional dense problem, may thus not be practical.

Paper Contributions:

We provide a simple analytical Bayesian method for online sparse logistic and probit regressions with closed form updates. We generalize the method also for dense multidimensional updates, if the problem is not completely sparse. Our results are first to study regret for Bayesian methods that are simple enough to be applied in practice. They provide an example to the connection between uncertainty and regret, and more broadly the Minimum Description Length (MDL) principle (Grunwald, 2007; Rissanen, 1978a; b; 1984; 1986; Shamir, 2015; 2020) . Empirical results demonstrate the advantages of our method over computationally involved methods and over other simpler approximations, both by achieving better regret and better loss on real data. As part of the algorithm, uncertainty measures are provided with no added complexity. We specifically demonstrate that it is sufficient to have an approximation focus on the location of the peak of the posterior and its curvature or value, which are most likely to dominate regret, instead of approximating the full posterior, which brings unnecessary complexity missing the real goal of preserving the effects of a good prior. In fact, approximating the full posterior may eventually lead to poor generalization and overfitting by focusing too much on the tails of the posterior. Our approach directly approximates the posterior, unlike VB methods that approximate by minimizing an upper bound on the loss. Finally, our approach leverages sparsity to solve a sparse problem.

Related Work:

The simplest single dimensional online logistic regression problem (d = 1 and x 1,t = 1, ∀t) was widely studied. Jefferys' prior, ρ(θ) = 1/ π θ(1 -θ) , is asymptotically optimal (Clarke & Barron, 1994; Xie & Barron, 1997; 2000; Drmota & Szpankowski., 2004) ). It can be expressed in terms of log-odds weights w as ρ(w) = e w/2 /[π (1 + e w )]. Applying a mixture leads to the Krichevsky & Trofimov (1981) (KT) add-1/2 estimator Q y t |y t-1 = [n t-1 (y t ) + 0.5]/t, where n t-1 (y t ) counts occurrences of y t . We use y t to express a sequence from 1 to t. Applying this prior in a Follow The Regularized Leader (FTRL) setting (McMahan, 2011 ) also leads to the KT estimator. This raised the question whether regret optimality generalizes to large dimensions (McMahan & Streeter, 2012) . Hazan et al. (2014) showed that this was not the case for proper methods. The bounds, however, theoretically generalize for Bayesian methods (Kakade & Ng, 2005; Foster et al., 2018; Shamir, 2020) , with large variance Gaussian or uniform prior with diagonal covariance. Peaked priors fail, as for each feature in an example, other features provide a self excluding log-odds prior, that shifts the relation between the overall distribution and the feature weight. While wide priors are good theoretically, because of the intractability of the Bayesian mixture integrals, diagonal approximations that are used unfortunately degrade their effect. Bayesian methods have been studied extensively for estimating posteriors and uncertainty (Bishop, 2006; Makowski et al., 2002; Sen & Stoffa, 1996) . There is ample literature researching such techniques in deep networks (see, e.g., Blundell et al. (2015) ; Hwang et al. (2020) ; Kendall & Gal (2017); Lakshminarayanan et al. (2017) ; Malinin & Gales (2018) ; Wilson (2020)). Most of the work focuses on the ultimate posterior after the full training dataset has been visited. One attempts to leverage the uncertainty measurements to aid in inference on unseen examples. Techniques like expectation propagation (EP) (Minka, 2001; 2013) (see also Bishop (2006) ; Chu et al. (2011); Cunningham et al. (2011) ; Graepel et al. (2010) ) and VB are used to generate estimates of the posterior. In a dense setup, where there is a relatively small number features (or units in a deep network), Gaussian Mixture models can also be learned, where a jointly Gaussian posterior is learned, usually with some kernel that is used to reduce the dimensionality of the parameters that are actually being trained. Such methods, however, do not fit the sparse online setup. Variational methods are derived utilizing Jensen's inequality to upper bound loss of expectation by expectation of the negative logarithm of the product of the prior ρ(w) and data likelihood P (y T |x T , w). Normalizing this joint by the expected label sequence probability gives the posterior P (w|x T , y T ). Then, a posterior Q(•) with a desired form is matched by minimizing the KL divergence KL(Q||P ), which decomposes into expectation w.r.t. Q( ) over the loss on y T and KL(Q||ρ) between the approximated posterior and the true prior. The first term may require techniques like the iterative mean field approximation EM (Bishop, 2006; Jaakkola & Jordan, 1998) , or MC sampling to be approximated. Gradient methods can also minimize KL(Q||P ). In the sparse setup, it is standard to assume a diagonal Q(•). In an online setting, the process can be iterated over the examples (or mini-batches), where the posterior at t is the prior at t + 1. Computing the approximate posterior may be very expensive if done for every example. SGD can be used with MC sampling, but that would incur additional losses, as the posterior changes between successive examples. Like VB, EP minimizes the opposite divergence KL(P ||Q) between the posterior and its approximate.

2. PRELIMINARIES

Let ρ t (w) be the prior on the weights at round t, where we start by initializing some ρ 1 (w). We will assume that ρ(•) is approximated by a diagonal covariance Gaussian, with means µ i,t and variances σ 2 i,t for component i at time t. Leveraging results in Kakade & Ng (2005) ; Foster et al. (2018) ; Shamir (2020), if we restrict w i ∈ [-B, B], a uniform prior over this interval or a normal prior with standard deviation proportional to B can be picked. (To approximate a Dirichlet-1/2, 0-mean normal prior with variance 2π can be used.) Observing sparse x t , the prediction for y t is given by p t = P (y t |x t ) = w p(y t |x t , w)ρ t (w)dw = w p t (y t , w|x t )dw, where for binary logistic regression, the probability of the label given the example and weights is given by the Sigmoid of the label weighted dot product of the example and weights p(y t |x t , w) = 1 1 + exp -y t x T t w = Sigma y t x T t w . The expected prediction p t in (1) marginalizes out the weights w according to the prior ρ t (•) from the joint probability of w and y t . The prediction p t is a function also of all prior pairs sequence {x t-1 , y t-1 } through the prior ρ t (•). After observing y t , we try to match a (diagonal) posterior Q(•) to the weights that will equal the next round's prior ρ t+1 (w) = Q t (w) ≈ p(w|x t , y t ) = p(y t |x t , w)ρ t (w) P (y t |x t ) = p(y t |x t , w)ρ t (w) p t . Using S T = {x T , y T }, the logarithmic loss incurred by approximation Q(•) on the sequence of predictions is L(S T , Q) = -T t=1 log p t . Let w * be some fixed comparator in the parameter values' space. Then, the regret of approximation Q(•) relative to comparator w * is given by R(S T , Q, w * ) = L(S T , Q) -L(S T , w * ) = - T t=1 log p t + log(1 + exp(-y t x T t w * ) . The regret can measure the excess loss relative to the best possible w * comparator, if it is chosen.

3. MARGINALIZED BAYESIAN GAUSSIAN APPROXIMATION

In this section, we describe the proposed method. First, the Sigmoid is approximated by a normal Cumulative Distribution Function (CDF). A prediction for the label of the current example is generated shrinking the cumulative mean score as function of the cumulative variance over all features. The main idea for updating feature distributions is marginalizing away all other covariates for each feature in an example at a given round, such that the mean and variance of the feature can be updated to match the location of the peak and either its curvature or value for the true marginalized posterior. In Appendix B, we demonstrate the same approach for Probit Regression. It follows the same steps, except that it does not require the initial approximation. Finally, Appendix D.1 shows how similar approximation methodology can be used to apply simple multi-dimensional updates instead of marginalized one, the can be performed when sparsity is limited. Gaussian Approximation of a Sigmoid: The relation between the logistic distribution and the Normal one was well studied in the statistics literature (see, e.g., Bishop (2006) ; Murphy (2012) ). The Sigmoid function in (2) can be viewed as a CDF, which can be approximated by a normal CDF Φ(z) (The inverse of Φ(•) is the Probit function.) The derivative of the Sigmoid function is the 0-mean Logistic Probability Density Function (PDF). Matching the PDFs, we have e w /(1 + e w ) 2 ≈ 1/ √ 2πσ 2 exp -w 2 /2σ 2 . This yields that the Sigmoid function can be approximated by a 0-mean Gaussian CDF with variance 8/π. Using the standard 0-mean normal Φ(•) function, the argument is scaled by the inverse of the standard deviation π/8, giving Sigma(w) = 1 1 + e -w ≈ Φ π 8 • w . More details about this approximation are in Appendix A. Approximation approach and some notation: With the diagonal and Gaussian assumptions, for each sparse example (with only d t d nonzero entries in x t ), we can assume that we have a single normal random variable, whose mean is the x t weighted mean, and whose variance is the quadratically weighted sum of variances. Denote the example total weight, mean, and variance by w t = d i=1 x i,t w i,t , µ t = d i=1 x i,t • µ i,t , σ 2 t = d i=1 x 2 i,t • σ 2 i,t (where the diagonalization assumption is important for the simplicity of the approximation of σ 2 t ). Since we consider a sparse problem, there is benefit to breaking the dependencies between features present in a given example and updating each independently. We can achieve that by marginalizing the prior at t over all other features. Because we assume all features are jointly independent Gaussians, we can break the joint prior into a product of two components; one, the marginal of the feature, and the other the marginal of all other features together, i.e., the self excluding prior. To match the posterior, we then marginalize on the latter, and match a single dimensional posterior for each feature. We define the self excluding prior for feature i at time t, its mean and variance as w -i,t = d j=1 x j,t w j,t -x i,t w i,t = j =i x j,t w j,t ; µ -i,t = µ t -x i,t µ i,t ; σ 2 -i,t = σ 2 t -x 2 i,t σ 2 i,t . (7) Prediction: With the probit approximation in ( 5) and the single dimensional variable w t , we can compute p t in (1), replacing p(y t |x t , w) in ( 2) by a normal CDF. Approximating this integral (see, e.g. Murphy (2012), Section 8.4.4.2, and Bishop (2006) ) gives p t ≈ Sigma   y t µ t 1 + π 8 σ 2 t   . This result demonstrates how the prediction variance shrinks the prediction towards probability 0.5. Marginalization: Given the diagonalization assumption, the prior at t can be expressed as ρ t (w) = ρ i,t (w i ) • ρ -i,t (w -i ), where ρ -i,t (•) is the prior on the self excluding prior of w i . Hence, p(y t , w|x t ) = p(y t |x t , w)ρ i,t (w i )ρ -i,t (w -i ). Marginalizing on w -i gives p(y t , w i |x t ) = ρ i,t (w i ) ∞ -∞ p(y t |x t , w)ρ -i,t (w -i )dw -i = ρ i,t (w i )I w-i,t . The inner integral, which marginalizes over w -i with its prior ρ -i,t (w -i ), can be approximated by I w-i,t = ∞ -∞ 1 2πσ 2 -i,t exp - (w -i -µ -i,t ) 2 2σ 2 -i,t • Sigma [y t (x i,t w i + w -i )] dw -i (a) ≈ ∞ -∞ 1 2πσ 2 -i,t exp - (w -i -µ -i,t ) 2 2σ 2 -i,t • Φ π 8 y t (x i,t w i + w -i ) dw -i (b) = ∞ -∞ φ(z) • Φ π 8 y t (x i,t w i + µ -i,t + σ -i,t z) dz (c) = Φ   π 8 y t (µ -i,t + x i,t w i ) 1 + π 8 σ 2 -i,t   (d) ≈ Sigma   y t (µ -i,t + x i,t w i ) 1 + π 8 σ 2 -i,t   . Step (a) follows, again, from the approximation in (5). For (b), we apply the change of variables z = (w -i -µ -i,t )/σ -i,t , where φ(•) is the standard Gaussian PDF. The integral in (b) gives Φ a √ 1+b 2 , with a = π 8 y t (µ -i,t + x i,t w i ) and b 2 = π 8 σ 2 -i,t to lead to (c). Finally, the approximation in ( 5) is used to go back from a Normal CDF to a Sigmoid in (d). Posterior: The posterior on w i is given by plugging (11) into (10) normalizing by p t given in (1). ρ i,t+1 (w i ) = Q i,t (w i ) ≈ p(w i |x t , y t ) = 1 p t • ρ i,t (w i ) • Sigma   y t (µ -i,t + x i,t w i ) 1 + π 8 σ 2 -i,t   . ( ) The approximation on the right implies matching the current true posterior with the ith component of the approximate posterior Q(•). It can be simplified to 1 σ i,t+1 exp - (w i -µ i,t+1 ) 2 2σ 2 i,t+1 ≈ 1 p t σ i,t exp - (w i -µ i,t ) 2 2σ 2 i,t • Sigma   y t (µ -i,t + x i,t w i ) 1 + π 8 σ 2 -i,t   . ( ) Approximations: Because the functional form of the posterior is not Gaussian, there are multiple ways to fit a Gaussian. We review alternatives in Appendix D. However, we want to ensure that the regions of the true posterior we are most likely to converge to at the horizon are not scaled down too much, as this will incur additional loss. It is thus desirable to match the peak of the true posterior with the peak of the approximation. One method is to match both the location and hight of the peak. The other, Laplace approximation (Bishop, 2006) , matches the location and curvature at the peak. Both methods give the same approximate for µ i,t+1 , but a somewhat different one for σ 2 i,t+1 . To give µ i,t+1 , we find w i that maximizes the r.h.s. of ( 13), or minimizes its negative logarithm. Let p i,t+ = Sigma   y t (µ -i,t + x i,t µ i,t+1 ) 1 + π 8 σ 2 -i,t   =   1 + exp   - y t (µ -i,t + x i,t µ i,t+1 ) 1 + π 8 σ 2 -i,t     -1 be almost p t in (8), except that µ i,t+1 replaces µ i,t and σ 2 -i,t replaces σ 2 t . Thus p i,t+ is the probability predicted for y t if we update µ i,t and shrink as function of σ 2 -i,t . The minimization gives µ i,t+1 = µ i,t + y t x i,t σ 2 i,t 1 + π 8 σ 2 -i,t • (1 -p i,t+ ) . ( ) Eq. ( 15) can be solved iteratively, where the Newton's method can be used, as described in Appendix C. The solution for µ i,t+1 can also be expressed in terms of the r generalized Lambert W function (Corless et al., 1996; Mezo & Baricz, 2015) . Alternatively, to avoid multiple iterations per update when using Newton's method, we can use a Taylor series approximation of 1 -p i,t+ around 1 -p i,t , where p i,t = Sigma   y t (µ -i,t + x i,t µ i,t ) 1 + π 8 σ 2 -i,t   = Sigma   y t µ t 1 + π 8 σ 2 -i,t   . Like p i,t+ , p i,t is not p t . Instead, it is the probability of y t as projected by the means of the weights at t, shrunk as function of σ 2 -i,t instead of σ 2 t . More importantly, it depens only on parameters before the update at t + 1 is applied, giving a closed form solution. Applying first order approximation we have µ i,t+1 = µ i,t + y t x i,t σ 2 i,t (1 -p i,t ) 1 + π 8 σ 2 -i,t 1 + 1 1+ π 8 σ 2 -i,t y 2 t x 2 i,t σ 2 i,t (1 -p i,t ) p i,t . If may be simpler to store the precision 1/σ 2 i,t , in which case, (17) may be easier to compute by normalizing both numerator and denominator by σ 2 i,t , applying this normalization on the right term of the denominator. Second or higher orders approximations can also be applied, but may not be necessary, as the first order one already gives identical performance to the iterative method. After updating µ i,t+1 , we can apply (13) to update σ i,t+1 . Plugging (15) in ( 13), we solve for σ i,t+1 , σ i,t+1 = p t σ i,t p i,t+ • exp (µ i,t+1 -µ i,t ) 2 2σ 2 i,t = p t σ i,t p i,t+ • exp y 2 t x 2 i,t σ 2 i,t 2(1 + π 8 σ 2 -i,t ) • (1 -p i,t+ ) 2 . ( ) Alternatively to (18), Laplace approximation can be used by finding the second derivative of the negative logarithm of the posterior, giving σ 2 i,t+1 = 1 σ 2 i,t + y 2 t x 2 i,t 1 + π 8 σ 2 -i,t • p i,t+ • (1 -p i,t+ ) -1 . ( ) The procedures described are summarized in Algorithm 1. In Appendix D, we describe several different methods and approximations that can be used, including one that applies the same approximation steps we applied in this section without the marginalization. As empirical results, in the next section, show, however, there is no performance advantage to applying any of the more involved methods. Algorithm 1 Marginalized Bayesian Gaussian Approximation 1: procedure MARGINALIZED BAYESIAN GAUSSIAN APPROXIMATION(Parameters: µ 0 , σ 2 0 ) 2: ∀i ∈ 1, . . . , d; µ i,1 ← µ 0 , σ 2 i,1 ← σ 2 0 . 3: for t=1,2,. . . ,T do 4: Get x t . 5: Compute µ t , σ 2 t with (6).

6:

Generate p t for y t ∈ {-1, 1} with (8). 7: Observe y t . 8: for i : x i,t = 0 do 9: Compute p i,t p i,t+ with ( 16) and ( 14), respectively, using µ i,t+1 = µ i,t for ( 14). 10: Iterate on ( 15) and ( 14) with Newton's method, or use ( 17) to update µ i,t+1 . 11: Update σ 2 i,t+1 with either ( 18) or ( 19) 12: end for 13: end for 14: end procedure 

4. NUMERICAL RESULTS

To benchmark regret of an algorithm, one needs the ground truth of real or loss minimizing parameters. On real data, the loss minimizing parameters are unknown. Furthermore, true benchmark datasets can consist of non-stationary data, and the feature sets selected by a model one trains may misspecify the "true" features of such a model. Thus real data may not give clean evaluation of the proposed methods. We present results on a benchmark dataset at the end of this section, but to measure regret performance of different algorithms, we present results on synthetic data. Synthetic Data: We simulated data by setting d features with true log-odds weights that were drawn randomly with some prior (or with multiple priors, each governing a subset of the features). At t, d t d features were selected in random from the set of d features, where different rules were used for different simulations to draw the d t features. In the fully random case, we set a random fraction α = d t /d parameter, and each feature was activated with probability α. We either used binary features x i,t ∈ {0, 1}, or also drew x i,t randomly in [0, 1]. For categorical features, the d features were partitioned into categories, and for every example, one or a preset number of features from each category were randomly selected, with different types of randomness, including fast decaying long tail distributions. Let θ be the vector of true parameters, then, Pr(Y t = 1) was computed as Sigma(x T t θ). The probability was used to randomly draw y t . We used Algorithm 1, as well as other algorithms, described in Appendix D, to sequentially predict y t and update. For gradient methods, we used Stochastic Gradient Descent (SGD) with AdaGrad scheduling (Duchi et al., 2011) . We ran grids of algorithm hyper-parameters for all algorithms to find optimal ones, and we show results for these optimal hyper-parameters for all algorithms. Since we know the true weights, we 17)). Labels also designate the prior used (σ 2 0 and µ 0 ), and which variance approximation was used (G for (18) and L for ( 19)). Reference results are shown for SGD, multi-dimensional Gaussian approximation update (DimGauss) described in Appendix D.1, EP; using Assumed Density Filtering (ADF), and marginalized VB (VBApprox), described in Appendix D.3. Fig. 1 shows normalized regret for two different true data configurations described at the top of each graph with random binary features. More detailed results on multiple configurations are shown in Appendix E. Unfortunately, unlike theoretical results (Shamir, 2020), the prior has to fit the data for all methods of Gaussian approximation for good regret. This is true for any known practical Bayesian method, as well as for the learning parameters of SGD. However, if we choose a prior that matches the true prior, regret rates logarithmic in T , usually close to the lower bounds of 0.5 log T per parameter, are achieved on all experiments with Algorithm 1. The same results are achieved whether we use the iterative version in (15) or its simpler approximation (17), and whether the variance is updated by ( 18) or (19). Unlike Algorithm 1, both, DimGauss and ADF appear to be optimized for priors that are different from the true one, and that depend on the fraction or number of features that occur in each example. DimGauss requires larger prior with more features. As the sparsity is reduced, DimGauss with its optimal prior seems to improve relative to Algorithm 1. This is expected, as the sparsity assumption becomes less valid. Algorithm 1 outperforms both ADF and SGD in all cases. We can find SGD hyper-parameters that seem to still exhibit logarithmic regret for each configuration, but are inferior to Algorithm 1, with increasing gaps with more active features. Table 1 shows both execution runtimes and regret coefficients r T = R T / log T for the full simulations with the algorithms and configurations in Fig 1 . Simulations were run on a single Ubuntu machine and included synthetic data generation and similar outputs for all algorithms compared. DimGauss, VB, and ADF may have a slight advantage, as they were implemented with the Eigen package, which is highly optimized for matrix operations. We show benchmarks for 10 6 and 10 7 examples. We observe rather equal runtimes for SGD, ADF, DimGauss and Algorithm 1, with slight advantage to ADF, which may be due to the highly optimized matrix operations. Good regret is obtained for both methods of Algorithm 1, but DimGauss slowly improves over Algorithm 1. This is true because selecting 20 or 40 features out of 200 results in repetitions of co-occurrences as more data is observed, and DimGauss utilizes these co-occurrences. We show below and also observe from the results in Appendix E, however, that with higher degrees of sparsity, the DimGauss algorithm degrades, while Algorithm 1 retains its advantages. Interestingly, the Newton method in the Gauss algorithm does not require more time than its Taylor approximation, and both regret and runtime are matched between the two. This is attributed to the fact that the Newton method requires very few iterations to converge. Boldfaced in the table are the poor runtime results of the VB meth- ods (shown with 100 and 1000 samples), whose worst-case complexity is O(d t N JT ), where N is the number of samples and J is the maximal number of iterations. The table illustrates how runtime increases substantially relative to all other algorithms whose complexity is O(d t T ). Regret only approaches that of Algorithm 1 with 1000 samples, and is far inferior with 100 samples. Fig. 2 (left) shows normalized regret of the different methods for a synthetic categorical model, with 22 categories. Category j, j = 1, 2, . . . consists of 2 j features (a total of over 8M ). For each example, a single feature is randomly drawn from each category with Zipf distribution c/(n+1) 1.75 , where c is a normalizer and n is the feature index in its category. The true weights are randomly drawn as before, with standard deviation 2. This gives a long tail distribution over features, where for each category small indices are drawn more often, but many features from the long tail do occur in examples. This models a realistic sparse dataset. Algorithm 1 outperforms other methods, and the best values of the DimGauss algorithm are inferior due to the sparsity.

Criteo display advertising challenge benchmark Dataset:

The right graph in Fig. 2 gives relative percent aggregate loss performance of the Bayesian algorithms relative to the best configuration we found for an AdaGrad SGD on the Criteo datasetfoot_0 . We trained all algorithms on the over 45M examples in this dataset, which consists of 13 integer valued features, and 26 categorial features with different category counts. For each example, we generated a prediction, computed its log loss on the label, and applied update. The aggregate log loss is a sum of data uncertainty and regret. The first term is equal for all algorithms and linear in the size of the data, where the second is the regret, which is sub-linear in the data size for a good algorithm. Thus even small noticeable percent improvements imply possible substantial improvements of regret. Algorithm 1 achieved with only linear models 0.465 progressive validation log loss, which is better, for example, than results reported also using deep networks in (Cheng et al., 2016) . We observe advantages to the Bayesian methods over SGD, where Algorithm 1 was superior to all methods. The DimGauss method slowly degrades relative to the other methods due to the sparsity of some of the categorical features.

5. CONCLUSIONS

We introduced a simple Bayesian mixture diagonal Gaussian approximation method based on marginalization for sparse online logistic regression and probit regression, that attempts to retain the affects of a good prior around the optimal values of the weights. The method does not require the complexities of standard Bayesian methods, as VB, but was empirically shown to achieve regret rates as good and even better. With proper priors, empirical results were close to regret lower bounds, and superior to other Bayesian methods also measured with their best choices of priors. With the strong relation between regret and uncertainty, this approach gives good uncertainty estimates. The methodology was proposed for logistic regression, and extended for probit regression, but can be further extended to other settings. We also demonstrated a matching approach that performs high dimensional updates, and can be used for dense problems. 

A RELATION BETWEEN GAUSSIAN AND SIGMOID

The Sigmoid function, which converts log-odds to probability is very close in shape to the Gaussian Cumulative Distribution Function (CDF) Φ(z), as well established in the statistics literature (see, e.g., Bishop (2006) ; Murphy (2012) ). The derivative of the Sigmoid function is given by 20) and equals the PDF of a 0-mean Logistic distribution. We can approximate the logistic PDF by a Gaussian by matching the PDFs, dSigma(w) dw = e w (1 + e w ) 2 e w (1 + e w ) 2 ≈ 1 √ 2πσ exp - w 2 2σ 2 . ( ) Matching the distributions at w = 0 yields σ = 8/π, e w (1 + e w ) 2 ≈ 1 4 exp - πw 2 16 = π 8 • 1 √ 2π exp - w 2 2 • 8 π . ( ) Thus, we can approximate the Sigmoid with a 0-mean Gaussian CDF with variance 8/π, giving (5). It remains to demonstrate that the PDFs (and CDFs) are close to each other not only at the peak. Figure 3 demonstrates the approximation of the Logistic distribution (left) and the Sigmoid function (middle) by a normal PDF and a normal CDF both with variance 8/π, respectively. The green curve shows the differences between the logistic/Sigmoid and the normal, which are also plotted on the right plot at larger scale. The magnitude of the difference between the Sigmoid and the normal CDF is bounded by 0.02 over the whole region. The differences appear asymmetric around the origin, and are substantially small at 1/3 standard deviation from the origin, or less. While they can accumulate over multiple examples, it appears that the most probable scenario is that positive and negative differences over multiple examples cancel each other. Furthermore, the motivation of a Bayesian method is to converge toward a peaked point mass, at the loss minimizing value of the parameter. As the variance is narrowed closer to such a point mass, the approximation tends to exist in the flat region around the origin, where the difference between the Sigmoid and the normal CDF is very small. 

B PROBIT REGRESSION

In this appendix, we show the derivation of the method proposed in this paper for Probit Regression, where, in a similar manner to (2), the predicted label probability with weight vector w, label y t , and covariates x t is given by the normal CDF p(y t |x t , w) = ytx T t w -∞ 1 √ 2π exp - α 2 2 dα = ytx T t w -∞ φ(α)dα = Φ y t x T t w where, as we recall, φ(•) and Φ(•) are the standard Gaussian (normal) PDF and CDF, respectively. While for logistic regression, we used a Gaussian approximation to obtain analytical expressions for the prediction in ( 8) and the marginalization integral in (11), for probit regression, these are no longer approximations. For the posterior, we will still apply a Gaussian and a diagonal approximations, as in the derivations based on (13). Prediction: The approach for probit regression is similar to the one described in Section 3 for logistic regression. For each feature we track the mean µ i,t and the variance σ 2 i,t for the ith feature. For example t, we use (6) to compute the total weight w t , its mean µ t and variance σ 2 t . Eq. ( 7) gives the self excluding weights, their means, and their variances. Similarly to (8), using the approximate normal prior at t, we can derive the label prediction for y t , p t = P (y t |x t ) = ∞ -∞ 1 2πσ 2 t • exp - (w t -µ t ) 2 2σ 2 t • Φ y t x T t w • dw t (a) = ∞ -∞ 1 2πσ 2 t • exp - (w t -µ t ) 2 2σ 2 t • ytwt -∞ 1 √ 2π exp - z 2 2 • dz • dw t (b) = ∞ -∞ 1 √ 2π • exp - v 2 2 • yt(σtv+µt) -∞ 1 √ 2π exp - z 2 2 • dz • dv (c) = ∞ -∞ φ(v) • Φ [y t (σ t v + µ t )] dv (d) = Φ y t µ t 1 + σ 2 t . For (a), we use the definition of w t in (6). Step (b) follows from substituting v = (w t -µ t )/σ t . Step (c) identified the integrands as a product of the standardized N (0, 1) normal PDF multiplied by a standardized normal CDF at y t (σ t v + µ t ). This integral gives a normal CDF Φ a √ 1+b 2 for a = y t µ t and b 2 = y 2 t σ 2 t = σ 2 t leading to (d). Marginalization: Following the marginalization steps in Section 3, we can express the joint probability of weight w i and label y t conditioned on the covariates x t and marginalized over all the other nonzero covariates at example t as in (10) by p(y t , w i |x t ) = ρ i,t (w i ) • Φ   y t (µ -i,t + x i,t w i ) 1 + σ 2 -i,t   where we use the steps of ( 11), excluding the approximations, to derive (25). Posterior: The posterior on w i is given as in ( 12), normalizing p(y t , w i |x t ) by p t from (24). ρ i,t+1 (w i ) ≈ p(w i |x t , y t ) = 1 p t • ρ i,t (w i ) • Φ   y t (µ -i,t + x i,t w i ) 1 + σ 2 -i,t   . This posterior can now be matched by a normal posterior Q i,t (w i ) as in ( 13). Approximation: Here, we follow the Laplace approximation applied in Section 3. All other methods mentioned in the paper are also possible. To find µ i,t+1 that minimizes the negative logarithm of the r.h.s. of ( 26), define, similarly to ( 14), z i,t+ = y t (µ -i,t + x i,t µ i,t+1 ) 1 + σ 2 -i,t as the probit score, which serves as the argument of the normal CDF, where the ith mean has been updated, but all other means have not. We can now express the update of the ith mean by µ i,t+1 = µ i,t + y t x i,t σ 2 i,t 1 + σ 2 -i,t • φ(z i,t+ ) Φ(z i,t+ ) . Eq. ( 28) is a similar update for probit regression to that of (15) for logistic regression, where the ratio φ(z i,t+ )/Φ(z i,t+ ) replaces 1 -p i,t+ (and the scaling of the self excluding variance is unnecessary). As ( 15), (28) must be solved iteratively because the term φ(z i,t+ )/Φ(z i,t+ ) is a function of µ i,t+1 through the definition of z i,t+ . As in Section 3, we can use a first order Taylor approximation of φ(z i,t+ )/Φ(z i,t+ ) around its value for µ i,t . Similarly to (16), we define z i,t = y t (µ -i,t + x i,t µ i,t ) 1 + σ 2 -i,t which is the score before update of all means µ i,t , but unlike the one used to compute p t , normalized by the ith self excluding variance σ 2 -i,t instead of σ 2 t . With some algebra, this gives a single operation update, similar to that in (17), given by µ i,t+1 = µ i,t + y t x i,t σ 2 i,t φ(z i,t )/Φ(z i,t ) 1 + σ 2 -i,t 1 + 1 1+σ 2 -i,t y 2 t x 2 i,t σ 2 i,t • φ(zi,t) Φ(zi,t) • z i,t + φ(zi,t) Φ(zi,t) . ( ) The term z i,t + φ(z i,t )/Φ(z i,t ) in the denominator replaces p i,t in the logistic regression update equation. Taking the second derivative of the negative logarithm of the posterior and approximating 1/σ 2 i,t+1 by it, gives a single operation update of the variance, similarly to (19), σ 2 i,t+1 = 1 σ 2 i,t + y 2 t x 2 i,t 1 + σ 2 -i,t • φ(z i,t+ ) Φ(z i,t+ ) • z i,t+ + φ(z i,t+ ) Φ(z i,t+ ) -1 . C MEAN AND VARIANCE UPDATES Eq. ( 15) gives an update for the mean µ i,t+1 that cannot be solved in closed form. This is beacuse p i,t+ is a function of µ i,t+1 . However, the update is easily solvable with a few iterations of Newton's Since Σ t is diagonal, the transpose on the last term of the numerator in the first equality is unnecessary. The second equality gives vector multiplications, showing that the complexity is linear in the dimension of the vectors d t . (This is true also to the computation of v t when Σ t is diagonal.) Next, u t+1 can be updated u t+1 = u t + y t (1 -pt ) Σt+1 x t . Now, we can update p t+ in (36), using u t+1 , and use it to update Σ t+1 using Sherman-Morrison, Σ t+1 = Σ t - y 2 t p t+ (1 -p t+ )v t v T t 1 + y 2 t p t+ (1 -p t+ )x T t v t . In the sparse case, we can now take the terms of the diagonal of Σ t+1 to update σ 2 i,t+1 of the nonzero covariates at round t. Finally, it may be simpler to update the precision matrix H t+1 = Σ -1 t+1 instead of the covariance Σ t+1 . Specifically, if multiple updates are performed in a mini batch, the update applied to the covariance cannot be applied additively. However, additive updates on the precision are valid. Thus the updates in ( 44) and ( 46) can be replaced by Ht+1 = H t + y 2 t pt (1 -pt )x t x T t ( ) and H t+1 = H t + y 2 t p t+ (1 -p t+ )x t x T t ( ) respectively. To update u t+1 , we still need to invert Ht+1 . We can use (44) if an update was applied to a single round only. If a mini-batch update additively applied multiple updates at once in (47), the updated Ht+1 must be inverted to obtain Σt+1 . The multi-dimensional approach described in this section can be applied to sparse problems, but also to dense problems. In the dense case, the operation in ( 43) is no longer linear in d t , as the covariance matrix is not necessarily diagonal. The use of Sherman-Morrison formula, however, to invert the precision and covariance, still applies and lowers the complexity of the approach. In the sparse problem, however, this approach may try to force correlations that are not there, that are then ignored. As empirical results suggest, it may not be as good as the marginalization approach because of that. Furthermore, unlike the marginalization approach in Section 3, which achieves best performance if the true prior matches the one used to initialize the algorithm, empirical results demonstrate that the best performances are obtained with priors that are different from the true one with the multi-dimensional method when applied on sparse problems.

D.2 EXPECTATION PROPAGATION -ASSUMED DENSITY FILTERING

Instead of minimizing the divergence between the approximate Q and the true posterior, we can use the expectation propagation approach, as proposed in Minka ( 2001), which essentially minimizes the opposite KL divergence, and attempts to match the first two moments. More details can be found in Minka (2001).

D.3 MARGINALIZED VARIATIONAL BAYES

Instead of using Laplace approximation or matching the location of the peak of the true posterior and the estimated one together with either its curvature or its value, we can apply full VB, by matching the approximate posterior Q with the true one through minimizing the KL divergence KL(Q||P ) between Q, the approximate posterior, and the true posterior. This requires either iterative approaches, such as mean field approximation EM, or Monte Carlo sampling in order to approximate expectation over a yet unknown Q. One can apply this approach in d t dimensions as the Laplace approximation in Subsection D.1. However, due to the inability to separate the covariates (in the Sigmoid), we would require a power set of samples. If we use N samples per dimension, this approach would use N dt samples. This can be infeasible and complex if there are a large number d t of nonzero covariates. Instead, we can use VB only in the ith dimension for each feature separately together with the marginalization proposed in Section 3. This can be done by matching the approximate posterior Q i,t (w i ) with the posterior p(w i |x t , y t ) we obtained on the r.h.s. of ( 12) on feature i after we marginalized on all other features. The KL divergence can be decomposed into three terms; the KL divergence between Q and the prior ρ i,t , the contribution of conditioning the posterior on the probability p t predicted for y t , and the log loss (negative log likelihood) term, emerging from the Sigmoid. KL(Q i,t (W i )||p(W i |x t , y t )) = KL(Q i,t ||ρ i,t ) + E Qi,t log p t + E Qi,t   log    1 + exp   - y t (µ -i,t + x i,t W i ) 1 + π 8 σ 2 -i,t        = E Qi,t log σ i,t p t σ i,t+1 - (W i -µ i,t+1 ) 2 2σ 2 i,t+1 + (W i -µ i,t ) 2 2σ 2 i,t + log    1 + exp   - y t (µ -i,t + x i,t W i ) 1 + π 8 σ 2 -i,t        = log σ i,t p t σ i,t+1 - 1 2 + 1 2σ 2 i,t σ 2 i,t+1 + µ 2 i,t+1 + µ 2 i,t -2µ i,t µ i,t+1 + E Qi,t   log    1 + exp   - y t (µ -i,t + x i,t W i ) 1 + π 8 σ 2 -i,t        . All expectations are w.r.t. Q i,t . The KL term can be computed in closed form, giving the second and third equalities. The last term on the r.h.s. of ( 49) cannot be analytically computed without knowledge of the posterior Q i,t at t. Instead, we use Monte Carlo, by drawing N samples S j ∼ N (0, 1), letting W i,j = µ i,t+1 + S j σ i,t+1 . With known µ i,t+1 and σ i,t+1 , we can now approximate the expectation term in (49) as 1 N N j=1 log    1 + exp   - y t (µ -i,t + x i,t (µ i,t+1 + s j σ i,t+1 )) 1 + π 8 σ 2 -i,t      where s j is the jth randomly drawn sample. Unfortunately, µ i,t+1 and σ 2 i,t+1 must be updated in this step, and are not known. This requires, again, an iterative update using Newton's method. Similarly to (14) in Section 3, we need to define a prediction p i,j,t+ for which the prior (time t) means and variances are used for all covariates except the ith one, and the updated µ i,t+1 and σ 2 i,t+1 are used for the ith mean and variance, respectively. This time, however, this prediction is defined N times, uniquely for each of the jth samples p i,j,t+ = Sigma   y t (µ -i,t + x i,t (µ i,t+1 + s j σ i,t+1 )) 1 + π 8 σ 2 -i,t   . Then, the updated mean satisfies µ i,t+1 = µ i,t + y t x i,t σ 2 i,t 1 + π 8 σ 2 -i,t • 1 N N j=1 (1 -p i,j,t+ ). However, in order to compute both µ i,t+1 and σ i,t+1 , we need to apply Newton's method, optimizing µ i,t+1 and σ i,t+1 together. We start with µ (0) i,t+1 = µ i,t and σ (0) i,t+1 = σ i,t . For simplicity, we omit the iteration number ( ) from the notation. The following should be read as updates at iteration which use p ( -1) i,j,t+ for the updates. For simplicity, let α t = y t x i,t / 1 + (π/8)σ 2 -i,t . Then, the joint gradient w.r.t. µ i,t and σ i,t is given by g =   µi,t+1-µi,t σ 2 i,t -αt N N j=1 (1 -p i,j,t+ ) -1 σi,t+1 + σi,t+1 σ 2 i,t -αt N N j=1 s j (1 -p i,j,t+ )   . The joint Hessian is given by H =   1 σ 2 i,t + α 2 t N N j=1 p i,j,t+ (1 -p i,j,t+ ) α 2 t N N j=1 s j p i,j,t+ (1 -p i,j,t+ ) α 2 t N N j=1 s j p i,j,t+ (1 -p i,j,t+ ) 1 σ 2 i,t+1 + 1 σ 2 i,t + α 2 t N N j=1 s 2 j p i,j,t+ (1 -p i,j,t+ )   . (53) Finally, at iteration , µ ( ) i,t+1 ( ) i,t+1 = µ ( -1) i,t+1 σ ( -1) i,t+1 -H -1 g. ( ) As before, the update terminates if the differences between the values of two iterations are less than some threshold, or after a set number of iterations. As observed, this method requires O(d t N J) operations for a single update (and O(d t N JT ) operations overall), where J is the set number of Newton iterations. Empirical results demonstrate that even with N as large as 1000, results are not as good as those with the marginalization with Laplace approximation, presented in Section 3. We note that one can use a two dimensional first order Taylor approximation on p i,j,t+ around p i,j,t , which is defined similarly to p i,j,t+ , with the exception of using µ i,t and σ i,t instead of µ i,t+1 and σ i,t+1 , to obtain an approximation for updating µ i,t+1 and σ i,t+1 , as in ( 17). The approximation should be made for every j. It will, however, not require the O(J) operations of the Newton method. The complexity is still of O(N ) factor greater than that of the method in Section 3.

E ADDITIONAL EMPIRICAL RESULTS

In this appendix, we bring a collection of simulation results demonstrating the performance of Algorithm 1 and the other methods in different settings. In all the simulations we performed we observed that Algorithm 1, with a proper prior for the setting, consistently gives regret close to the lower bound (e.g., 0.5 log T cost for each unknown parameter). The other methods, while in some cases exhibit performance close to that of Algorithm 1, fail to do so consistently in all conditions. The marginalized VB approach requires very large complexity as shown in Table 1 and Fig. 1 to approach the performance of Algorithm 1. Fig. 4 gives a two dimensional grid of varying d t and varying true feature weights. In all these simulations, Algorithm 1 with prior matched to the true one, gives minimal regret curves. While SGD seems, with the right choices of parameters, to approach logarithmic regret, its regret is larger, and increases with the true parameters' variance and number of active features d t . Multi-dimensional Gaussian approximation appears to be mis-calibrated on the prior, and depending on the feature density d t tends to become better only with much larger priors. EP ADF gives larger regrets and while appearing reasonable with higher feature density, seems to be inferior with lower feature densities. Similar results are obtained when replacing the data generation model with models with uniform priors on the feature weights with various ranges. Fig. 5 shows curves for simulations with different models. On the left, features are nonbinary, and on the right, the model consists of order of magnitude more features, where in average an order of magnitude more features occur in each example. In both cases, Algorithm 1 persists with similar regret rates, whereas other algorithms exhibit larger regrets. With a large number of features, both the multi dimensional Gaussian approximation and SGD have larger regrets, although with even lower prior / learning rates slightly better regret may be possible. Fig. 6 demonstrates curves of the various algorithms for categorical features, where in each example a fixed set of features are selected from each category. On the right, the selection gives exponentially decaying distribution over the features, so that some features are selected very often while others rarely occur. Again, regret rates behave similarly for Algorithm 1 and SGD. The multi-dimensional Gaussian approximation completely breaks in this setting. This is because it hypothesizes correlations in its updates with features that rarely reoccur. 



https://www.kaggle.com/c/criteo-display-ad-challenge



Figure 1: R t / log t vs. round t for different methods for randomly drawn d binary features, expected d t /d features per example, and standard deviation of true log-odds noted in each graph.

Figure 2: Left: R t / log t vs. t for synthetic data with millions of long tail features. Right: Loss relative to the best SGD on the Criteo benchmark dataset for multiple algorithms.

Figure 3: Left: Logistic and Normal N (0, 8/π) distributions and their differences. Middle: Sigmoid and Normal CDF N (0, 8/π) and their differences. Right: Differences between logistic and normal (PDFs and CDFs).

Figure 4: Normalized R t / log t vs. round t for various methods with randomly drawn binary features, with d, expected d t /d, and standard deviation of true log-odds noted in each graph. Graphs shown for d = 200, E[d t ] ∈ {5, 20, 40} and true log-odds std in {1, 2, 3}.

Figure 5: Normalized R t / log t vs. round t for different algorithms and different data generation models. On the left, a model with d = 200 features, out of which in average d t = 40 occur in an example, and weight standard deviation is 2, with nonbinary feature values uniform in [0, 1]. Right: models with d = 2000, average d t = 200, data generation standard deviation of 2, with binary features.

Figure 6: Normalized by R t / log t vs. round t for categorical models and different algorithms. Left: 2000 features in 10 categories of 200 features each, where in each example 5 features from each category are present, with true weight standard deviation of 2. On the right: A similar setting, except that in each category features are selected with a long tail exponential distribution (prioritizing a few features and rarely selecting others).

Runtime and regret coefficients r T = R T / log T for different algorithms on synthetic 200 features models with true log-odds std of 1 and algorithm parameters as in Fig 1.Instead of showing R t , we plot R t / log t. If an algorithm has logarithmic regret, the normalized curve of the algorithm will converge to the constant. This methodology thus allows us to observe whether an algorithm has logarithmic regret or not. Results are shown for Algorithm 1 with its different variations (labeled by Gauss for updates with (15), and by Gauss Approx with (

Gil I. Shamir. Minimum description length (MDL) regularization for online learning. In Proceedings of the 1st International Workshop on Feature Extraction: Modern Questions and Challenges at NIPS 2015, pp. 260-276, 2015. Gil I Shamir. Logistic regression regret: What's the catch? In COLT -Conference on Learning Theory, volume 125 of Proceedings of Machine Learning Research, pp. 3296-3319, 2020.

annex

method. We start by plugging µ ( =0) i,t+1 = µ i,t for iteration = 0. We follow by computing p (0) i,t+ with (14). The solution for µ i,t+1 is the value of w i that minimizes the negative logarithm of the r.h.s. of ( 12) where ρ i,t (w i ) is Gaussian with mean µ i,t and variance σ 2 i,t . At iteration , we can compute p ( -1) i,t+ with (14), using µ ( -1) i,t+1 . The gradient of the negative logarithm of the posterior w.r.t. w i at w i = µ ( -1) i,t+1 is given byThe second derivative w.r.t. w i is given byThen, µ i,t+1 is updated byThe process terminates when the difference µi,t+1 is smaller than some threshold, or a specified number of iterations had already executed. Note that the second derivative in (33) (the Newton Hessian step) gives the same expression as the update to the precision (inverse variance) in (19).Eq. ( 18) gives an update to the variance. It is interesting to interpret some observations from this update. For x i,t ≥ 0 , p i,t+ ≥ p t . This leads to y t x i,t (µ i,t+1 -µ i,t ) > 0 because for every y t , µ i,t+1 has to move away from µ i,t with the sign of y t . This implies that we add a positive term from the log-odds converted to p t to those converted into p i,t+ and also apply less shrinkage using σ 2 -i,t instead of σ 2 t , yielding this claim. Hence, the coefficient of σ i,t outside the exponential term in (18) is upper bounded by 1. The argument of the exponential term is always nonnegative, which implies that the exponential term is lower bounded by 1. However, if the self-excluding variance σ 2 -i,t is large, it makes the argument small. Similarly, if p i,t+ is large relative to p t it makes the whole expression smaller. These two observations imply that if the self excluding prior is less certain, or if the self excluding prior has a different belief about the label from the one observed overall, the uncertainty of the current feature reduces more, because it is deemed responsible for the observation y t . On the other hand, if the opposite holds, i.e., either the uncertainty of the self excluding prior is low, or the self excluding prior agrees more with the observed label y t , then, feature i matters less for the observation, and therefore, its uncertainty is not reduced as much.

D OTHER METHODS

In this appendix, we describe updates we perform with other Gaussian approximation methods. We can update the mean vector and covariance of d t features that occur at t, without marginalization, discarding off-diagonal covariance terms. This approach, that otherwise uses similar approximations to the ones we used in Section 3, is described in Section D.1, where some of the approximation steps are novel to this paper.Alternatively, one can choose any two points on the true posterior and match Q(•) on these. Least squares can be used with several points that represent a region of w i , for which the posterior at T is likely to have most mass. EP and VB can be applied, discussed in Sections D.2 and D.3, respectively. For the latter, specifically, single dimensional VB on the marginalized posterior can be applied to minimize KL(Q i,t (w i )||p(w i |x t , y t )), where p(w i |x t , y t ) is the true posterior on the r.h.s. of (12) given by p(y t , w i |x t )/p t , where p(y t , w i |x t ) is in (10). This still requires expensive Monte Carlo estimation of expectations with an iterative Newton method. Without marginalization, however, a similar VB approach would require even more expensive Monte Carlo sampling or iterative mean field approximation EM to converge on all components.

D.1 MULTI-DIMENSIONAL GAUSSIAN APPROXIMATION

Instead of marginalizing on all other features to update w i for which x i,t = 0, we can apply multidimensional update on all features for which x i,t = 0 at round t. Such updates will enhance correlation between these features, and may be a better fit to problems in which such correlation is expected. For this update, we assume that the true posterior consists of a product between a prior with a diagonal covariance matrix and a Sigmoid, and we apply Lapace approximation to obtain new mean vector and covariance. With some abuse of notation, let all values at t consist only of the d t nonzero components of x t . Let Σ t be the diagonal covariance matrix, with diagonal elements σ 2 i,t . Let u t be the estimated mean vector at t. Then, the true posterior at t is given bySimilarly to ( 14), define p t+ = Sigma y t x T t u t+1 (36) as the probability of y t computed with weights after they had been updated (and this time with no shrinkage), where u t+1 is the updated vector of means. Then, with Laplace approximation, taking the value of the mean vector that maximizes the posterior, the mean can be updated as in ( 15) byThis is, again, an equation that must be solved either numerically, or using methods such as Newton's method. Again, we can assign u (0) t+1 = u t , and apply (36) on it to obtain pt+ . Then, at iteration ,and).(39) Then, u t+1 is updated by uTermination is either when the update on all components of u t+1 is less than some threshold, or after a set number of iterations. Inverting the Hessian H also gives the updated covariance Σ t+1 , whose diagonal elements can be now used to update σ 2 i,t+1 , if we apply the algorithm for a sparse problem, where it is infeasible to store all covariances.Instead of updating H , we can keep track of its inverse H -1 ( ) , and there is no need to invert the covariance matrix Σ t . With the diagonal form of Σ t , all operations can be implemented with linear complexity in d t using the Sherman & Morrison (1950) formula, which simplifies matrix inversions for special matrices. For our specific need here, if A is some matrix, α some constant, and x some vector, then, the Sherman-Morrison formula is), we update H -1 ( ) , inverting (39). As in the marginalization method described in Section 3, we can avoid the iterative Newton method with a first order Taylor approximation of 1 -p t+ around 1 -pt , where pt is defined in an analogy to (16) as pt = Sigma y t x T t u t (42) as the un-shrunk prediction of y t at round t (which is different from p t , which is shrunk by the variance). The approximation leads to the following set of equations to update both u t+1 and Σ t+1 

