MISSO: MINIMIZATION BY INCREMENTAL STOCHAS-TIC SURROGATE OPTIMIZATION FOR LARGE SCALE NONCONVEX AND NONSMOOTH PROBLEMS

Abstract

Many constrained, nonconvex and nonsmooth optimization problems can be tackled using the majorization-minimization (MM) method which alternates between constructing a surrogate function which upper bounds the objective function, and then minimizing this surrogate. For problems which minimize a finite sum of functions, a stochastic version of the MM method selects a batch of functions at random at each iteration and optimizes the accumulated surrogate. However, in many cases of interest such as variational inference for latent variable models, the surrogate functions are expressed as an expectation. In this contribution, we propose a doubly stochastic MM method based on Monte Carlo approximation of these stochastic surrogates. We establish asymptotic and non-asymptotic convergence of our scheme in a constrained, nonconvex, nonsmooth optimization setting. We apply our new framework for inference of logistic regression model with missing data and for variational inference of Bayesian variants of LeNet-5 and Resnet-18 on respectively the MNIST and CIFAR-10 datasets.

1. INTRODUCTION

We consider the constrained minimization problem of a finite sum of functions: min θ∈Θ L(θ) := 1 n n i=1 L i (θ) , where Θ is a convex, compact, and closed subset of R p , and for any i ∈ 1, n , the function L i : R p → R is bounded from below and is (possibly) nonconvex and nonsmooth. To tackle the optimization problem (1), a popular approach is to apply the majorization-minimization (MM) method which iteratively minimizes a majorizing surrogate function. A large number of existing procedures fall into this general framework, for instance gradient-based or proximal methods or the Expectation-Maximization (EM) algorithm (McLachlan & Krishnan, 2008) and some variational Bayes inference techniques (Jordan et al., 1999) ; see for example (Razaviyayn et al., 2013) and (Lange, 2016) and the references therein. When the number of terms n in (1) is large, the vanilla MM method may be intractable because it requires to construct a surrogate function for all the n terms L i at each iteration. Here, a remedy is to apply the Minimization by Incremental Surrogate Optimization (MISO) method proposed by Mairal (2015) , where the surrogate functions are updated incrementally. The MISO method can be interpreted as a combination of MM and ideas which have emerged for variance reduction in stochastic gradient methods (Schmidt et al., 2017) . An extended analysis of MISO has been proposed in (Qian et al., 2019) . The success of the MISO method rests upon the efficient minimization of surrogates such as convex functions, see (Mairal, 2015, Section 2.3). A notable application of MISO-like algorithms is described in (Mensch et al., 2017) where the authors builds upon the stochastic majorizationminimization framework of Mairal ( 2015) to introduce a method for sparse matrix factorization. Yet, in many applications of interest, the natural surrogate functions are intractable, yet they are defined as expectation of tractable functions. For instance, this is the case for inference in latent variable models via maximum likelihood (McLachlan & Krishnan, 2008) . Another application is variational inference (Ghahramani, 2015) , in which the goal is to approximate the posterior distribution of parameters given the observations; see for example (Neal, 2012; Blundell et al., 2015; Polson et al., 2017; Rezende et al., 2014; Li & Gal, 2017) . This paper fills the gap in the literature by proposing a method called Minimization by Incremental Stochastic Surrogate Optimization (MISSO), designed for the nonconvex and nonsmooth finite sum optimization, with a finite-time convergence guarantee. Our work aims at formulating a generic class of incremental stochastic surrogate methods for nonconvex optimization and building the theory to understand its behavior. In particular, we provide convergence guarantees for stochastic EM and Variational Inference-type methods, under mild conditions. In summary, our contributions are: • we propose a unifying framework of analysis for incremental stochastic surrogate optimization when the surrogates are defined as expectations of tractable functions. The proposed MISSO method is built on the Monte Carlo integration of the intractable surrogate function, i.e., a doubly stochastic surrogate optimization scheme. • we present an incremental update of the commonly used variational inference and Monte Carlo EM methods as special cases of our newly introduced framework. The analysis of those two algorithms is thus conducted under this unifying framework of analysis. • we establish both asymptotic and non-asymptotic convergence for the MISSO method. In particular, the MISSO method converges almost surely to a stationary point and in O(n/ ) iterations to an -stationary point, see Theorem 1. • in essence, we relax the class of surrogate functions used in MISO (Mairal, 2015) and allow for intractable surrogates that can only be evaluated by Monte-Carlo approximations. Working at the crossroads of Optimization and Sampling constitutes what we believe to be the novelty and the technicality of our framework and theoretical results. In Section 2, we review the techniques for incremental minimization of finite sum functions based on the MM principle; specifically, we review the MISO method (Mairal, 2015) , and present a class of surrogate functions expressed as an expectation over a latent space. The MISSO method is then introduced for the latter class of intractable surrogate functions requiring approximation. In Section 3, we provide the asymptotic and non-asymptotic convergence analysis for the MISSO method (and of the MISO (Mairal, 2015) one as a special case). Section 4 presents numerical applications including parameter inference for logistic regression with missing data and variational inference for two types of Bayesian neural networks. The proofs of theoretical results are reported as Supplement. Notations. We denote 1, n = {1, . . . , n}. Unless otherwise specified, • denotes the standard Euclidean norm and • | • is the inner product in the Euclidean space. For any function f : Θ → R, f (θ, d) is the directional derivative of f at θ along the direction d, i.e., f (θ, d) := lim t→0 + f (θ + td) -f (θ) t . The directional derivative is assumed to exist for the functions introduced throughout this paper.

2. INCREMENTAL MINIMIZATION OF FINITE SUM NONCONVEX FUNCTIONS

The objective function in ( 1) is composed of a finite sum of possibly nonsmooth and nonconvex functions. A popular approach here is to apply the MM method, which tackles (1) through alternating between two steps -(i) minimizing a surrogate function which upper bounds the original objective function; and (ii) updating the surrogate function to tighten the upper bound. As mentioned in the introduction, the MISO method (Mairal, 2015) is developed as an iterative scheme that only updates the surrogate functions partially at each iteration. Formally, for any i ∈ 1, n , we consider a surrogate function L i (θ; θ) which satisfies the assumptions (H1, H2): H1. For all i ∈ 1, n and θ ∈ Θ, L i (θ; θ) is convex w.r.t. θ, and it holds L i (θ; θ) ≥ L i (θ), ∀ θ ∈ Θ , where the equality holds when θ = θ.

