EXACT MANIFOLD GAUSSIAN VARIATIONAL

Abstract

We propose an optimization algorithm for Variational Inference (VI) in complex models. Our approach relies on natural gradient updates where the variational space is a Riemann manifold. We develop an efficient algorithm for Gaussian Variational Inference that implicitly satisfies the positive definite constraint on the variational covariance matrix. Our Exact manifold Gaussian Variational Bayes (EMGVB) provides exact but simple update rules and is straightforward to implement. Due to its black-box nature, EMGVB stands as a ready-to-use solution for VI in complex models. Over five datasets, we empirically validate our feasible approach on different statistical and econometric models, discussing its performance with respect to baseline methods.

1. INTRODUCTION

Although Bayesian principles are not new to Machine Learning (ML) (e.g. Mackay, 1992; 1995; Lampinen & Vehtari, 2001) , it has been only recently that feasible methods boosted a growing use of Bayesian methods within the field (e.g. Zhang et al., 2018; Trusheim et al., 2018; Osawa et al., 2019; Khan et al., 2018b; Khan & Nielsen, 2018) . In the typical ML settings the applicability of sampling methods for the challenging computation of the posterior is prohibitive, however approximate methods such as Variational Inference (VI) have been proved suitable and successful (Saul et al., 1996; Wainwright & Jordan, 2008; Hoffman et al., 2013; Blei et al., 2017) . VI is generally performed with Stochastic Gradient Descent (SGD) methods (Robbins & Monro, 1951; Hoffman et al., 2013; Salimans & Knowles, 2014) , boosted by the use of natural gradients (Hoffman et al., 2013; Wierstra et al., 2014; Khan et al., 2018b) , and the updates often take a simple form (Khan & Nielsen, 2018; Osawa et al., 2019; Magris et al., 2022) . The majority of VI algorithms rely on the extensive use of models' gradients and the form of the variational posterior implies additional model-specific derivations that are not easy to adapt to a general, plug-and-play optimizer. Black box methods (Ranganath et al., 2014) , are straightforward to implement and versatile use as they avoid model-specific derivations by relying on stochastic sampling (Salimans & Knowles, 2014; Paisley et al., 2012; Kingma & Welling, 2013) . The increased variance in the gradient estimates as opposed to e.g. methods relying on the Reparametrization Trick (RT) (Blundell et al., 2015; Xu et al., 2019 ) can be alleviated with variance reduction techniques (e.g Magris et al., 2022) . Furthermore, the majority of existing algorithms do not directly address parameters' constraints. Under the typical Gaussian variational assumption, granting positive-definiteness of the covariance matrix is an acknowledged problem (e.g Tran et al., 2021a; Khan et al., 2018b; Lin et al., 2020) . Only a few algorithms directly tackle the problem (Osawa et al., 2019; Lin et al., 2020) , see Section 3. A recent approximate approach based on manifold optimization is provided by Tran et al. (2021a) . On the theoretical results of Khan & Lin (2017); Khan et al. (2018a) we develop an exact version of Tran et al. (2021a) , resulting in an algorithm that explicitly tackles the positive-definiteness constraint for the variational covariance matrix and resembles the readily-applicable natural-gradient black-box framework of (Magris et al., 2022) . For its implementation, we discuss recommendations and practicalities, show that EMGVB is of simple implementation, and demonstrate its feasibility in extensive experiments over four datasets, 12 models, and three competing VI optimizers. In Section 2 we review the basis of VI, in Section 3 we review the Manifold Gaussian Variational Bayes approach and other related works, in Section 4 we discuss our proposed approach. Experi-ments are found in Section 5, while Section 6 concludes. Appendices A, B complement the main discussion, Appendix C.4 reinforces and expands the experiments, Appendix D provides proofs.

2. VARIATIONAL INFERENCE

Variational Inference (VI) stands as a convenient and feasible approximate method for Bayesian inference. Let y denote the data, p(y|θ) the likelihood of the data based on some model whose k-dimensional parameter is θ. Let p(θ) be the prior distribution on θ. In standard Bayesian inference the posterior is retrieved via the Bayes theorem as p(θ|y) = p(θ)p(y|θ)/p(y). As the marginal likelihood p(y) is generally intractable, Bayesian inference is often difficult for complex models. Though the problem can be tackled with sampling methods, Monte Carlo techniques, although nonparametric and asymptotically exact may be slow, especially in high-dimensional applications (Salimans et al., 2015) . VI approximates the true unknown posterior with a probability density q within a tractable class of distributions Q, such as the exponential family. VI turns the Bayesian inference problem into that of finding the best variational distribution q ⋆ ∈ Q minimizing the Kullback-Leibler (KL) divergence from q to p(θ|y): q ⋆ = arg min q∈Q D KL (q||p(θ|y)). It can be shown that the KL minimization problem is equivalent to the maximization of the so-called Lower Bound (LB) on log p(y), (e.g. Tran et al., 2021b) . In fixed-form variational Bayes, the parametric form of the variational posterior is set. The optimization problem accounts for finding the optimal variational parameter ζ parametrizing q ≡ q ζ that maximizes the LB (L), that is: ζ ⋆ = arg max ζ∈Z L(ζ) := q ζ (θ) log p(θ)p(y|θ) q ζ (θ) dθ = E q ζ log p(θ)p(y|θ) q ζ (θ) , where E q means that the expectation is taken with respect to the distribution q ζ , and Z is the parameter space for ζ. The maximization of the LB is generally tackled with a gradient-descent method such as SGD (Robbins & Monro, 1951) , ADAM Kingma & Ba (2014), or ADAGRAD Duchi et al. (2011) . The learning of the parameter ζ based on standard gradient descent is however problematic as it ignores the information geometry of the distribution q ζ , is not scale invariant, unstable, and very susceptible to the initial values (Wierstra et al., 2014) . SGD implicitly relies on the Euclidean norm for capturing the dissimilarity between two distributions, which can be a poor and misleading measure of dissimilarity (Khan & Nielsen, 2018) . By using the KL divergence in place of the Euclidean norm, the SGD update results in the following natural gradient update: A major issue in following this approach is that ζ is unconstrained. Think of a Gaussian variational posterior: under the above update, there is no guarantee that the covariance matrix is iteratively updated onto a symmetric and positive definite matrix. As discussed in the introduction, manifold optimization is an attractive possibility. ζ t+1 = ζ t + β t ∇ζ L(ζ)

3. RELATED WORK

In Tran et al. (2021a) , a d-dimensional Gaussian distribution N (µ, Σ), provides the fixed-form of the variational posterior q ζ = (µ, vec(Σ)). There are no restrictions on µ yet the covariance matrix Σ is constrained to the manifold M of symmetric positive-definite matric es, M = Σ ∈ R d×d : Σ = Σ ⊤ , Σ ≻ 0 , see e.g. (Abraham et al., 2012; Hu et al., 2020) .



is a possibly adaptive learning rate, and t denotes the iteration. The above update results in improved steps towards the maximum of the LB when optimizing it for the variational parameter ζ. The natural gradient ∇ζ L(ζ) is obtained by rescaling the euclidean gradient ∇ ζ L(ζ) by the inverse of the Fisher Information Matrix (FIM) I ζ , i.e. ∇ζ L(ζ) = I ζ ∇ ζ L(ζ). For readability, we shall write L in place of L(ζ).

