MISSO: MINIMIZATION BY INCREMENTAL STOCHAS-TIC SURROGATE OPTIMIZATION FOR LARGE SCALE NONCONVEX AND NONSMOOTH PROBLEMS

Abstract

Many constrained, nonconvex and nonsmooth optimization problems can be tackled using the majorization-minimization (MM) method which alternates between constructing a surrogate function which upper bounds the objective function, and then minimizing this surrogate. For problems which minimize a finite sum of functions, a stochastic version of the MM method selects a batch of functions at random at each iteration and optimizes the accumulated surrogate. However, in many cases of interest such as variational inference for latent variable models, the surrogate functions are expressed as an expectation. In this contribution, we propose a doubly stochastic MM method based on Monte Carlo approximation of these stochastic surrogates. We establish asymptotic and non-asymptotic convergence of our scheme in a constrained, nonconvex, nonsmooth optimization setting. We apply our new framework for inference of logistic regression model with missing data and for variational inference of Bayesian variants of LeNet-5 and Resnet-18 on respectively the MNIST and CIFAR-10 datasets.

1. INTRODUCTION

We consider the constrained minimization problem of a finite sum of functions: min θ∈Θ L(θ) := 1 n n i=1 L i (θ) , where Θ is a convex, compact, and closed subset of R p , and for any i ∈ 1, n , the function L i : R p → R is bounded from below and is (possibly) nonconvex and nonsmooth. To tackle the optimization problem (1), a popular approach is to apply the majorization-minimization (MM) method which iteratively minimizes a majorizing surrogate function. A large number of existing procedures fall into this general framework, for instance gradient-based or proximal methods or the Expectation-Maximization (EM) algorithm (McLachlan & Krishnan, 2008) and some variational Bayes inference techniques (Jordan et al., 1999) ; see for example (Razaviyayn et al., 2013) and (Lange, 2016) and the references therein. When the number of terms n in (1) is large, the vanilla MM method may be intractable because it requires to construct a surrogate function for all the n terms L i at each iteration. Here, a remedy is to apply the Minimization by Incremental Surrogate Optimization (MISO) method proposed by Mairal (2015) , where the surrogate functions are updated incrementally. The MISO method can be interpreted as a combination of MM and ideas which have emerged for variance reduction in stochastic gradient methods (Schmidt et al., 2017) . An extended analysis of MISO has been proposed in (Qian et al., 2019) . The success of the MISO method rests upon the efficient minimization of surrogates such as convex functions, see (Mairal, 2015, Section 2.3) . A notable application of MISO-like algorithms is described in (Mensch et al., 2017) where the authors builds upon the stochastic majorizationminimization framework of Mairal (2015) to introduce a method for sparse matrix factorization. Yet, in many applications of interest, the natural surrogate functions are intractable, yet they are defined as expectation of tractable functions. For instance, this is the case for inference in latent variable models via maximum likelihood (McLachlan & Krishnan, 2008) . Another application is variational inference (Ghahramani, 2015) , in which the goal is to approximate the posterior distribution of parameters given the observations; see for example (Neal, 2012; Blundell et al., 2015; Polson et al., 2017; Rezende et al., 2014; Li & Gal, 2017) . This paper fills the gap in the literature by proposing a method called Minimization by Incremental Stochastic Surrogate Optimization (MISSO), designed for the nonconvex and nonsmooth finite sum optimization, with a finite-time convergence guarantee. Our work aims at formulating a generic class of incremental stochastic surrogate methods for nonconvex optimization and building the theory to understand its behavior. In particular, we provide convergence guarantees for stochastic EM and Variational Inference-type methods, under mild conditions. In summary, our contributions are: • we propose a unifying framework of analysis for incremental stochastic surrogate optimization when the surrogates are defined as expectations of tractable functions. The proposed MISSO method is built on the Monte Carlo integration of the intractable surrogate function, i.e., a doubly stochastic surrogate optimization scheme. • we present an incremental update of the commonly used variational inference and Monte Carlo EM methods as special cases of our newly introduced framework. The analysis of those two algorithms is thus conducted under this unifying framework of analysis. • we establish both asymptotic and non-asymptotic convergence for the MISSO method. In particular, the MISSO method converges almost surely to a stationary point and in O(n/ ) iterations to an -stationary point, see Theorem 1. • in essence, we relax the class of surrogate functions used in MISO (Mairal, 2015) and allow for intractable surrogates that can only be evaluated by Monte-Carlo approximations. Working at the crossroads of Optimization and Sampling constitutes what we believe to be the novelty and the technicality of our framework and theoretical results. In Section 2, we review the techniques for incremental minimization of finite sum functions based on the MM principle; specifically, we review the MISO method (Mairal, 2015) , and present a class of surrogate functions expressed as an expectation over a latent space. The MISSO method is then introduced for the latter class of intractable surrogate functions requiring approximation. In Section 3, we provide the asymptotic and non-asymptotic convergence analysis for the MISSO method (and of the MISO (Mairal, 2015) one as a special case). Section 4 presents numerical applications including parameter inference for logistic regression with missing data and variational inference for two types of Bayesian neural networks. The proofs of theoretical results are reported as Supplement. Notations. We denote 1, n = {1, . . . , n}. Unless otherwise specified, • denotes the standard Euclidean norm and • | • is the inner product in the Euclidean space. For any function f : Θ → R, f (θ, d) is the directional derivative of f at θ along the direction d, i.e., f (θ, d) := lim t→0 + f (θ + td) -f (θ) t . The directional derivative is assumed to exist for the functions introduced throughout this paper.

2. INCREMENTAL MINIMIZATION OF FINITE SUM NONCONVEX FUNCTIONS

The objective function in ( 1) is composed of a finite sum of possibly nonsmooth and nonconvex functions. A popular approach here is to apply the MM method, which tackles (1) through alternating between two steps -(i) minimizing a surrogate function which upper bounds the original objective function; and (ii) updating the surrogate function to tighten the upper bound. As mentioned in the introduction, the MISO method (Mairal, 2015) is developed as an iterative scheme that only updates the surrogate functions partially at each iteration. Formally, for any i ∈ 1, n , we consider a surrogate function L i (θ; θ) which satisfies the assumptions (H1, H2): H1. For all i ∈ 1, n and θ ∈ Θ, L i (θ; θ) is convex w.r.t. θ, and it holds L i (θ; θ) ≥ L i (θ), ∀ θ ∈ Θ , where the equality holds when θ = θ.

H2.

For any θ i ∈ Θ, i ∈ 1, n and some > 0, the difference function e(θ; {θ i } n i=1 ) := 1 n n i=1 L i (θ; θ i ) -L(θ) is defined for all θ ∈ Θ and differentiable for all θ ∈ Θ, where Θ = {θ ∈ R d , inf θ ∈Θ θθ < } is an -neighborhood set of Θ. Moreover, for some constant L, the gradient satisfies ∇ e(θ; {θ i } n i=1 ) 2 ≤ 2L e(θ; {θ i } n i=1 ), ∀ θ ∈ Θ . ( ) Algorithm 1 The MISO method (Mairal, 2015) . 1: Input: initialization θ (0) . 2: Initialize the surrogate function as A 0 i (θ) := L i (θ; θ (0) ), i ∈ 1, n . 3: for k = 0, 1, ..., K max do 4: Pick i k uniformly from 1, n . 5: Update A k+1 i (θ) as: A k+1 i (θ) = L i (θ; θ (k) ), if i = i k A k i (θ), otherwise. 6: Set θ (k+1) ∈ arg min θ∈Θ 1 n n i=1 A k+1 i (θ).

7: end for

We remark that H1 is a common assumption used for surrogate functions, see (Mairal, 2015, Section 2.3 ). H2 can be satisfied when the difference function e(θ; {θ i } n i=1 ) is L-smooth, i.e ., e is differentiable on Θ and its gradient ∇ e is L-Lipschitz, ∀θ ∈ Θ. H2 can be implied by applying (Razaviyayn et al., 2013 , Proposition 1).

The inequality (3) implies

L i (θ; θ) ≥ L i (θ) > -∞ for any θ ∈ Θ. The MISO method is an incremental version of the MM method, as summarized by Algorithm 1, which shows that the MISO method maintains an iteratively updated set of upper-bounding surrogate functions {A k i (θ)} n i=1 and updates the iterate via minimizing the average of the surrogate functions. Particularly, only one out of the n surrogate functions is updated at each iteration [cf. Line 5] and the sum function 1 n n i=1 A k+1 i (θ ) is designed to be 'easy to optimize', which, for example, can be a sum of quadratic functions. As such, the MISO method is suitable for large-scale optimization as the computation cost per iteration is independent of n. Under H1, H2, it was shown that the MISO method converges almost surely to a stationary point of (1) (Mairal, 2015, Prop. 3.1) . We now consider the case when the surrogate functions L i (θ; θ) are intractable. Let Z be a measurable set, p i : Z × Θ → R + a probability density function, r i : Θ × Θ × Z → R a measurable function and µ i a σ-finite measure. We consider surrogate functions which satisfy H1, H2 and that can be expressed as an expectation, i.e.: L i (θ; θ) := Z r i (θ; θ, z i )p i (z i ; θ)µ i (dz i ) ∀ (θ, θ) ∈ Θ × Θ . (5) Plugging (5) into the MISO method is not feasible since the update step in Step 6 involves a minimization of an expectation. Several motivating examples of (1) are given in Section 2. In this paper, we propose the Minimization by Incremental Stochastic Surrogate Optimization (MISSO) method which replaces the expectation in (5) by Monte Carlo integration and then optimizes the objective function (1) in an incremental manner. Denote by M ∈ N the Monte Carlo batch size and let {z m ∈ Z} M m=1 be a set of samples. These samples can be drawn (Case 1) i.i.d. from the distribution p i (•; θ) or (Case 2) from a Markov chain with stationary distribution p i (•; θ); see Section 3 for illustrations. To this end, we define the stochastic surrogate as follows: L i (θ; θ, {z m } M m=1 ) := 1 M M m=1 r i (θ; θ, z m ) , and we summarize the proposed MISSO method in Algorithm 2. Compared to the MISO method, there is a crucial difference in that the MISSO method involves two types of randomness. The first level of randomness comes from the selection of i k in Line 5. The second level of randomness stems from the set of Monte Carlo approximated functions A k i (θ) used in lieu of A k i (θ) in Line 6 when optimizing for the next iterate θ (k) . We now discuss two applications of the MISSO method. Example 1: Maximum Likelihood Estimation for Latent Variable Model. Latent variable models (Bishop, 2006) are constructed by introducing unobserved (latent) variables which help explain the observed data. We consider n independent observations ((y i , z i ), i ∈ n ) where y i is observed and z i is latent. In this incomplete data framework, define {f i (z i , θ), θ ∈ Θ} to be the complete Algorithm 2 The MISSO method. 1: Input: initialization θ (0) ; a sequence of non-negative numbers {M (k) } ∞ k=0 . 2: For all i ∈ 1, n , draw M (0) Monte Carlo samples with the stationary distribution p i (•; θ (0) ). 3: Initialize the surrogate function as A 0 i (θ) := L i (θ; θ (0) , {z i,m } M (0) m=1 ), i ∈ 1, n . 4: for k = 0, 1, ..., K max do 5: Pick a function index i k uniformly on 1, n . 6: Draw M (k) Monte Carlo samples with the stationary distribution p i (•; θ (k) ).

7:

Update the individual surrogate functions recursively as: A k+1 i (θ) = L i (θ; θ (k) , {z (k) i,m } M (k) m=1 ), if i = i k A k i (θ), otherwise. 8: Set θ (k+1) ∈ arg min θ∈Θ L (k+1) (θ) := 1 n n i=1 A k+1 i (θ) . 9: end for data likelihood models, i.e., the joint likelihood of the observations and latent variables. Let g i (θ) := Z f i (z i , θ)µ i (dz i ), i ∈ 1, n , θ ∈ Θ denote the incomplete data likelihood, i.e., the marginal likelihood of the observations y i . For ease of notations, the dependence on the observations is made implicit. The maximum likelihood (ML) estimation problem sets the individual objective function L i (θ) to be the i-th negated incomplete data log-likelihood L i (θ) := -log g i (θ). Assume, without loss of generality, that g i (θ) = 0 for all θ ∈ Θ. We define by p i (z i , θ) := f i (z i , θ)/g i (θ) the conditional distribution of the latent variable z i given the observations y i . A surrogate function L i (θ; θ) satisfying H1 can be obtained through writing f i (z i , θ) = fi (zi,θ) pi(zi,θ) p i (z i , θ) and applying the Jensen inequality: L i (θ; θ) = Z log p i (z i , θ)/f i (z i , θ) =ri(θ;θ,zi) p i (z i , θ)µ i (dz i ) . We note that H2 can also be verified for common distribution models. We can apply the MISSO method following the above specification of r i (θ; θ, z i ) and p i (z i , θ). Example 2: Variational Inference. Let ((x i , y i ), i ∈ 1, n ) be i.i.d. input-output pairs and w ∈ W ⊆ R d be a latent variable. When conditioned on the input data x = (x i , i ∈ 1, n ), the joint distribution of y = (y i , i ∈ 1, n ) and w is given by: p(y, w|x) = π(w) n i=1 p(y i |x i , w) . ( ) Our goal is to compute the posterior distribution p(w|y, x). In most cases, the posterior distribution p(w|y, x) is intractable and is approximated using a family of parametric distributions, {q(w, θ), θ ∈ Θ}. The variational inference (VI) problem (Blei et al., 2017) boils down to minimizing the Kullback-Leibler (KL) divergence between q(w, θ) and the posterior distribution p(w|y, x): min θ∈Θ L(θ) := KL (q(w; θ)||p(w|y, x)) := E q(w;θ) log q(w; θ)/p(w|y, x) . Using (8), we decompose L(θ) = n -1 n i=1 L i (θ) + const. where: L i (θ) := -E q(w;θ) log p(y i |x i , w) + 1 n E q(w;θ) log q(w; θ)/π(w) := r i (θ) + d(θ) . ( ) Directly optimizing the finite sum objective function in (9) can be difficult. First, with n 1, evaluating the objective function L(θ) requires a full pass over the entire dataset. Second, for some complex models, the expectations in (10) can be intractable even if we assume a simple parametric model for q(w; θ). Assume that L i is L-smooth. We apply the MISSO method with a quadratic surrogate function defined as: L i (θ; θ) := L i (θ) + ∇ θ L i (θ) | θ -θ + L 2 θ -θ 2 , (θ, θ) ∈ Θ 2 . ( ) It is easily checked that the quadratic function L i (θ; θ) satisfies H1, H2. To compute the gradient ∇L i (θ), we apply the re-parametrization technique suggested in (Paisley et al., 2012; Kingma & Welling, 2014; Blundell et al., 2015) . Let t : R d × Θ → R d be a differentiable function w.r.t. θ ∈ Θ which is designed such that the law of w = t(z, θ) is q(•, θ), where z ∼ N d (0, I). By (Blundell et al., 2015 , Proposition 1), the gradient of -r i (•) in ( 10) is: ∇ θ E q(w;θ) log p(y i |x i , w) = E z∼N d (0,I) J t θ (z, θ)∇ w log p(y i |x i , w) w=t(z,θ) , where for each z ∈ R d , J t θ (z, θ) is the Jacobian of the function t(z, •) with respect to θ evaluated at θ. In addition, for most cases, the term ∇d(θ) can be evaluated in closed form as the gradient of the KL between the prior distribution π(•) and the variational candidate q(•, θ). r i (θ; θ, z) := ∇ θ d(θ) -J t θ (z, θ)∇ w log p(y i |x i , w) w=t(z,θ) | θ -θ + L 2 θ -θ 2 . (13) Finally, using ( 11) and ( 13), the surrogate function ( 6) is given by L i (θ; θ, {z m } M m=1 ) := M -1 M m=1 r i (θ; θ, z m ) where {z m } M m=1 are i.i.d samples drawn from N (0, I).

3. CONVERGENCE ANALYSIS

We now provide asymptotic and non-asymptotic convergence results of our method. Assume: H3. For all i ∈ 1, n , θ ∈ Θ, z i ∈ Z, r i (•; θ, z i ) is convex on Θ and is lower bounded. We are particularly interested in the constrained optimization setting where Θ is a bounded set. To this end, we control the supremum norm of the MC approximation, introduced in (6), as: H4. For the samples {z i,m } M m=1 , there exist finite constants C r and C gr such that C r := sup θ∈Θ sup M >0 1 √ M E θ sup θ∈Θ M m=1 r i (θ; θ, z i,m ) -L i (θ; θ) C gr := sup θ∈Θ sup M >0 √ M E θ   sup θ∈Θ 1 M M m=1 L i (θ, θ -θ; θ) -r i (θ, θ -θ; θ, z i,m ) θ -θ 2   for all i ∈ 1, n , and we denoted by E θ [•] the expectation w.r.t. a Markov chain {z i,m } M m=1 with initial distribution ξ i (•; θ), transition kernel Π i,θ , and stationary distribution p i (•; θ). Some intuitions behind the controlling terms: It is common in statistical and optimization problems, to deal with the manipulation and the control of random variables indexed by sets with an infinite number of elements. Here, the controlled random variable is an image of a continuous function defined as r i (θ; θ, z i,m ) -L i (θ; θ) for all z ∈ Z and for fixed (θ, θ) ∈ Θ 2 . To characterize such control, we will have recourse to the notion of metric entropy (or bracketing number) as developed in (Van der Vaart, 2000; Vershynin, 2018; Wainwright, 2019) . A collection of results from those references gives intuition behind our assumption H4, which is classical in empirical processes. In (Vershynin, 2018, Theorem 8.2.3) , the authors recall the uniform law of large numbers: E sup f ∈F 1 M M i=1 f (z i,m ) -E[f (z i )] ≤ CL √ M for all z i,m , i ∈ 1, M , where F is a class of L-Lipschitz functions. Moreover, in (Vershynin, 2018, Theorem 8.1.3 ) and (Wainwright, 2019, Theorem 5.22) , the application of the Dudley inequality yields: E[sup f ∈F |X f -X 0 |] ≤ 1 √ M 1 0 log N (F, • ∞ , ε)dε , where N (F, • ∞ , ε) is the bracketing number and denotes the level of approximation (the bracketing number goes to infinity when → 0). Finally, in (Van der Vaart, 2000, p.271, Example) , N (F, • ∞ , ε) is bounded from above for a class of parametric functions F = f θ : θ ∈ Θ: N (F, • ∞ , ε) ≤ K diam Θ ε d , for all 0 < ε < diam Θ . The authors acknowledge that those bounds are a dramatic manifestation of the curse of dimensionality happening when sampling is needed. Nevertheless, the dependence on the dimension highly depends on the class of surrogate functions F used in our scheme, as smaller bounds on these controlling terms can be derived for simpler class of functions, such as quadratic functions. Stationarity measure. As problem ( 1) is a constrained optimization task, we consider the following stationarity measure: g(θ) := inf θ∈Θ L (θ, θ -θ) θ -θ and g(θ) = g + (θ) -g -(θ) , where g + (θ) := max{0, g(θ)}, g -(θ) := -min{0, g(θ)} denote the positive and negative part of g(θ), respectively. Note that θ is a stationary point if and only if g -(θ) = 0 (Fletcher et al., 2002) . Furthermore, suppose that the sequence {θ (k) } k≥0 has a limit point θ that is a stationary point, then one has lim k→∞ g -(θ (k) ) = 0. Thus, the sequence {θ (k) } k≥0 is said to satisfy an asymptotic stationary point condition. This is equivalent to (Mairal, 2015, Definition 2.4 ). To facilitate our analysis, we define τ k i as the iteration index where the i-th function is last accessed in the MISSO method prior to iteration k, τ k+1 i k = k for instance. We define: L (k) (θ) := 1 n n i=1 L i (θ; θ (τ k i ) ), e (k) (θ) := L (k) (θ) -L(θ), M (k) := Kmax-1 k=0 M -1/2 (k) . We first establish a non-asymptotic convergence rate for the MISSO method: Theorem 1. Under H1-H4. For any K max ∈ N, let K be an independent discrete r.v. drawn uniformly from {0, ..., K max -1} and define the following quantity: ∆ (Kmax) := 2nLE[ L (0) (θ (0) ) -L (Kmax) (θ (Kmax) )] + 4LC r M (k) . Then we have following non-asymptotic bounds: E ∇ e (K) (θ (K) ) 2 ≤ ∆ (Kmax) K max and E[g -(θ (K) )] ≤ ∆ (Kmax) K max + C gr K max M (k) . (16) Note that ∆ (Kmax) is finite for any K max ∈ N. Iteration Complexity of MISSO. As expected, the MISSO method converges to a stationary point of ( 1) asymptotically and at a sublinear rate E[g (K) -] ≤ O( ∆ (Kmax) /K max ). In other terms, MISSO requires O(nL/ ) iterations to reach an -stationary point when the suboptimality condition, that characterizes stationarity, is E g -(θ (K) ) 2 . Note that this stationarity criterion are similar to the usual quantity used in stochastic nonconvex optimization, i.e., E ∇L(θ (K) ) 2 . In fact, when the optimization problem (1) is unconstrained, i.e., Θ = R p , then E g(θ (K) ) = E ∇L(θ (K) ) . Sample Complexity of MISSO. Regarding the sample complexity of our method, setting M (k) = k 2 /n 2 , as a non-decreasing sequence of integers satisfying ∞ k=0 M -1/2 (k) < ∞, in order to keep ∆ (Kmax) nL, then the MISSO method requires nL/ k=0 k 2 /n 2 = nL 3 / 3 samples to reach an -stationary point. Furthermore,we remark that the MISO method can be analyzed in Theorem 1 as a special case of the MISSO method satisfying C r = C gr = 0. In this case, while the asymptotic convergence is well known from (Mairal, 2015) [cf. H4], Eq. ( 16) gives a non-asymptotic rate of E[g (K) -] ≤ O( nL/K max ) which is new to our best knowledge. Next, we show that under an additional assumption on the sequence of batch size M (k) , the MISSO method converges almost surely to a stationary point: Theorem 2. Under H1-H4. In addition, assume that {M (k) } k≥0 is a non-decreasing sequence of integers which satisfies ∞ k=0 M -1/2 (k) < ∞. Then: 1. the negative part of the stationarity measure converges a.s. to zero, i.e., lim k→∞ g -(θ (k) ) a.s. = 0. 2. the objective value L(θ (k) ) converges a.s. to a finite number L, i.e., lim k→∞ L(θ (k) ) a.s.

= L.

In particular, the first result above shows that the sequence {θ (k) } k≥0 produced by the MISSO method satisfies an asymptotic stationary point condition.

4.1. BINARY LOGISTIC REGRESSION WITH MISSING VALUES

This application follows Example 1 described in Section 2. We consider a binary regression setup, ((y i , z i ), i ∈ n ) where y i ∈ {0, 1} is a binary response and z i = (z i,j ∈ R, j ∈ p ) is a covariate vector. The vector of covariates z i = [z i,mis , z i,obs ] is not fully observed where we denote by z i,mis the missing values and z i,obs the observed covariate. It is assumed that (z i , i ∈ n ) are i.i.d. and marginally distributed according to N (β, Ω) where β ∈ R p and Ω is a positive definite p×p matrix. We define the conditional distribution of the observations y i given z i = (z i,mis , z i,obs ) as: p i (y i |z i ) = S(δ zi ) yi 1 -S(δ zi ) 1-yi , where for u ∈ R, S(u) = 1/(1+e -u ), δ = (δ 0 , • • • , δ p ) are the logistic parameters and zi = (1, z i ). Here, θ = (δ, β, Ω) is the parameter to estimate. For i ∈ n , the complete log-likelihood reads: log f i (z i,mis , θ) ∝ y i δ zi -log 1 + exp(δ zi ) - 1 2 log(|Ω|) + 1 2 Tr Ω -1 (z i -β)(z i -β) . Fitting a logistic regression model on the TraumaBase dataset: We apply the MISSO method to fit a logistic regression model on the TraumaBase (http://traumabase.eu) dataset, which consists of data collected from 15 trauma centers in France, covering measurements on patients from the initial to last stage of trauma. This dataset includes information from the first stage of the trauma, namely initial observations on the patient's accident site to the last stage being intense care at the hospital and counts more than 200 variables measured for more than 7 000 patients. Since the dataset considered is heterogeneous -coming from multiple sources with frequently missed entries -we apply the latent data model described in (17) to predict the risk of a severe hemorrhage which is one of the main cause of death after a major trauma. Similar to (Jiang et al., 2018) , we select p = 16 influential quantitative measurements, on n = 6384 patients. For the Monte Carlo sampling of z i,mis , required while running MISSO, we run a Metropolis-Hastings algorithm with the target distribution p(•|z i,obs , y i ; θ (k) ). & Tanner, 1990) and the proposed MISSO method. For the MISSO method, we set the batch size to M (k) = 10 + k 2 and we examine with selecting different number of functions in Line 5 in the method -the default settings with 1 (MISSO), 10% (MISSO10) and 50% (MISSO50) minibatches per iteration. From Figure 1 , the MISSO method converges to a static value with less number of epochs than the MCEM, SAEM methods. It is worth noting that the difference among the MISSO runs for different number of selected functions demonstrates a variance-cost tradeoff. Though wall clock times are similar for all methods, they are reported in the appendix for completeness.

4.2. TRAINING BAYESIAN CNN USING MISSO

This application follows Example 2 described in Section 2. We use variational inference and the ELBO loss (10) to fit Bayesian Neural Networks on different datasets. At iteration k, minimizing the sum of stochastic surrogates defined as in ( 6) and ( 13) yields the following MISSO updatestep (i) pick a function index i k uniformly on n ; step (ii) sample a Monte Carlo batch {z (k) m } M (k) m=1 from N (0, I); and step (iii) update the parameters, with w = t(θ (k-1) , z m ), as MNIST (LeCun et al., 1998) : We apply the MISSO method to fit a Bayesian variant of LeNet-5 (LeCun et al., 1998) . We train this network on the MNIST dataset (LeCun, 1998) . The training set is composed of n = 55 000 handwritten digits, 28 × 28 images. Each image is labelled with its corresponding number (from zero to nine). Under the prior distribution π, see (8), the weights are assumed independent and identically distributed according to N (0, 1). We also assume that q(•; θ) ≡ N (µ, σ 2 I). The variational posterior parameters are thus θ = (µ, σ) where µ = (µ , ∈ d ) where d is the number of weights in the neural network. We use the re-parametrization as w = t(θ, z) = µ + σz with z ∼ N (0, I). µ (k) = μ(τ k ) - γ n n i=1 δ(k) µ ,i , δ(k) µ ,i k = - 1 M (k) M (k) m=1 ∇ w log p(y i k |x i k , w) + ∇ µ d(θ (k-1) ) , where μ(τ k ) = 1 n n i=1 µ (τ k i ) and d(θ) = n -1 d =1 -log(σ) + (σ 2 + µ 2 )/2 -1/2 . Bayesian LeNet-5 on Bayesian ResNet-18 (He et al., 2016) on CIFAR-10 ( Krizhevsky et al., 2012) : We train here the Bayesian variant of the ResNet-18 neural network introduced in (He et al., 2016) on CIFAR-10. The latter dataset is composed of n = 60 000 handwritten digits, 32 × 32 colour images in 10 classes, with 6 000 images per class. As in the previous example, the weights are assumed independent and identically distributed according to N (0, I). Standard hyperparameters values found in the literature, such as the annealing constant or the number of MC samples, were used for the benchmark methods. For efficiency purpose and lower variance, the Flipout estimator (Wen et al., 2018) is used. Experiment Results: We compare the convergence of the Monte Carlo variants of the following state of the art optimization algorithms -the ADAM (Kingma & Ba, 2015), the Momentum (Sutskever et al., 2013) and the SAG (Schmidt et al., 2017) methods versus the Bayes by Backprop (BBB) (Blundell et al., 2015) and our proposed MISSO method. For all these methods, the loss function (10) and its gradients were computed by Monte Carlo integration based on the reparametrization described above. The mini-batch of indices and MC samples are respectively set to 128 and M (k) = k. The learning rates are set to 10 -3 for LeNet-5 and 10 -4 for Resnet-18. Figure 2 (a) shows the convergence of the negated evidence lower bound against the number of passes over data (one pass represents an epoch). As observed, the proposed MISSO method outperforms Bayes by Backprop and Momentum, while similar convergence rates are observed with the MISSO, ADAM and SAG methods for our experiment on MNIST dataset using a Bayesian variant of LeNet-5. On the other hand, the experiment conducted on CIFAR-10 (Figure 2 (b)) using a much larger network, i.e., a Bayesian variant of ResNet-18 showcases the need of a well-tuned adaptive methods to reach lower training loss (and also faster). Our MISSO method is similar to the Monte Carlo variant of ADAM but slower than Adagrad optimizer. Recall that the purpose of this paper is to provide a common class of optimizers, such as VI, in order to study their convergence behaviors, and not to introduce a novel method outperforming the baselines methods. We report wall clock times for all methods in the appendix for completeness.

5. CONCLUSION

We present a unifying framework for minimizing a nonconvex and nonsmooth finite-sum objective function using incremental surrogates when the latter functions are expressed as an expectation and are intractable. Our approach covers a large class of nonconvex applications in machine learning such as logistic regression with missing values and variational inference. We provide both finitetime and asymptotic guarantees of our incremental stochastic surrogate optimization technique and illustrate our findings training a binary logistic regression with missing covariates to predict hemorrhagic shock and Bayesian variants of two Convolutional Neural Networks on benchmark datasets.

A.1 PROOF OF THEOREM 1

Theorem. Under H1-H4. For any K max ∈ N, let K be an independent discrete r.v. drawn uniformly from {0, ..., K max -1} and define the following quantity: ∆ (Kmax) := 2nLE[ L (0) (θ (0) ) -L (Kmax) (θ (Kmax) )] + 4LC r M (k) . Then we have following non-asymptotic bounds: E ∇ e (K) (θ (K) ) 2 ≤ ∆ (Kmax) K max and E[g -(θ (K) )] ≤ ∆ (Kmax) K max + C gr K max M (k) . Proof We begin by recalling the definition L (k) (θ) := 1 n n i=1 A k i (θ) . Notice that L (k+1) (θ) = 1 n n i=1 L i (θ; θ (τ k+1 i ) , {z (τ k+1 i ) i,m } M (τ k+1 i ) m=1 ) = L (k) (θ) + 1 n L i k (θ; θ (k) , {z (k) i k ,m } M (k) m=1 ) -L i k (θ; θ (τ k i k ) , {z (τ k i k ) i k ,m } M (τ k i k ) m=1 ) . Furthermore, we recall that L (k) (θ) := 1 n n i=1 L i (θ; θ (τ k i ) ), e (k) (θ) := L (k) (θ) -L(θ) . Due to H2, we have ∇ e (k) (θ (k) ) 2 ≤ 2L e (k) (θ (k) ) . ( ) To prove the first bound in (16), using the optimality of θ (k+1) , one has L (k+1) (θ (k+1) ) ≤ L (k+1) (θ (k) ) = L (k) (θ (k) ) + 1 n L i k (θ (k) ; θ (k) , {z (k) i k ,m } M (k) m=1 ) -L i k (θ (k) ; θ (τ k i k ) , {z (τ k i k ) i k ,m } M (τ k i k ) m=1 ) . Let F k be the filtration of random variables up to iteration k, i.e., {i -1 , {z ( -1) i -1 ,m } M ( -1) m=1 , θ ( ) } k =1 . We observe that the conditional expectation evaluates to E i k E L i k (θ (k) ; θ (k) , {z (k) i k ,m } M (k) m=1 )|F k , i k |F k = L(θ (k) ) + E i k E 1 M (k) M (k) m=1 r i k (θ (k) ; θ (k) , z (k) i k ,m ) -L i k (θ (k) ; θ (k) )|F k , i k |F k ≤ L(θ (k) ) + C r M (k) , where the last inequality is due to H4. Moreover, E L i k (θ (k) ; θ (τ k i k ) , {z (τ k i k ) i k ,m } M (τ k i k ) m=1 )|F k = 1 n n i=1 L i (θ (k) ; θ (τ k i ) , {z (τ k i ) i,m } M (τ k i ) m=1 ) = L (k) (θ (k) ) . Taking the conditional expectations on both sides of ( 19) and re-arranging terms give: L (k) (θ (k) ) -L(θ (k) ) ≤ nE L (k) (θ (k) ) -L (k+1) (θ (k+1) )|F k + C r M (k) . ( ) Proceeding from (20), we observe the following lower bound for the left hand side L (k) (θ (k) ) -L(θ (k) ) (a) = L (k) (θ (k) ) -L (k) (θ (k) ) + e (k) (θ (k) ) (b) ≥ L (k) (θ (k) ) -L (k) (θ (k) ) + 1 2L ∇ e (k) (θ (k) ) 2 = 1 n n i=1 1 M (τ k i ) M (τ k i ) m=1 r i (θ (k) ; θ (τ k i ) , z (τ k i ) i,m ) -L i (θ (k) ; θ (τ k i ) ) :=-δ (k) (θ (k) ) + 1 2L ∇ e (k) (θ (k) ) 2 , where (a) is due to e (k) (θ (k) ) = 0 [cf. H1], (b) is due to ( 18) and we have defined the summation in the last equality as -δ (k) (θ (k) ). Substituting the above into (20) yields ∇ e (k) (θ (k) ) 2 2L ≤ nE L (k) (θ (k) ) -L (k+1) (θ (k+1) )|F k + C r M (k) + δ (k) (θ (k) ) . ( ) Observe the following upper bound on the total expectations: E δ (k) (θ (k) ) ≤ E 1 n n i=1 C r M (τ k i ) , which is due to H4. It yields E ∇ e (k) (θ (k) ) 2 ≤ 2nLE L (k) (θ (k) ) -L (k+1) (θ (k+1) ) + 2LC r M (k) + 1 n n i=1 E 2LC r M (τ k i ) . Finally, for any K max ∈ N, we let K be a discrete r.v. that is uniformly drawn from {0, 1, ..., K max -1}. Using H4 and taking total expectations lead to E ∇ e (K) (θ (K) ) 2 = 1 K max Kmax-1 k=0 E[ ∇ e (k) (θ (k) ) 2 ] ≤ 2nLE[ L (0) (θ (0) ) -L (Kmax) (θ (Kmax) )] K max + 2LC r K max Kmax-1 k=0 E 1 M (k) + 1 n n i=1 1 M (τ k i ) . ( ) For all i ∈ 1, n , the index i is selected with a probability equal to 1 n when conditioned independently on the past. We observe: E[M -1/2 (τ k i ) ] = k j=1 1 n 1 - 1 n j-1 M -1/2 (k-j) Taking the sum yields: Kmax-1 k=0 E[M -1/2 (τ k i ) ] = Kmax-1 k=0 k j=1 1 n 1 - 1 n j-1 M -1/2 (k-j) = Kmax-1 k=0 k-1 l=0 1 n 1 - 1 n k-(l+1) M -1/2 (l) = Kmax-1 l=0 M -1/2 (l) Kmax-1 k=l+1 1 n 1 - 1 n k-(l+1) ≤ Kmax-1 l=0 M -1/2 (l) , ( ) where the last inequality is due to upper bounding the geometric series. Plugging this back into ( 22) yields E ∇ e (K) (θ (K) ) 2 = 1 K max Kmax-1 k=0 E[ ∇ e (k) (θ (k) ) 2 ] ≤ 2nLE[ L (0) (θ (0) ) -L (Kmax) (θ (Kmax) )] K max + 1 K max Kmax-1 k=0 4LC r M (k) = ∆ (Kmax) K max . This concludes our proof for the first inequality in (16). To prove the second inequality of ( 16), we define the shorthand notations g (k) := g(θ (k) ), g -:= -min{0, g (k) }, g (k) + := max{0, g (k) }. We observe that g (k) = inf θ∈Θ L (θ (k) , θ -θ (k) ) θ (k) -θ = inf θ∈Θ 1 n n i=1 L i (θ (k) , θ -θ (k) ; θ (τ k i ) ) θ (k) -θ - ∇ e (k) (θ (k) ) | θ -θ (k) θ (k) -θ ≥ -∇ e (k) (θ (k) ) + inf θ∈Θ 1 n n i=1 L i (θ (k) , θ -θ (k) ; θ (τ k i ) ) θ (k) -θ , where the last inequality is due to the Cauchy-Schwarz inequality and we have defined L i (θ, d; θ (τ k i ) ) as the directional derivative of L i (•; θ (τ k i ) ) at θ along the direction d. Moreover, for any θ ∈ Θ, 1 n n i=1 L i (θ (k) , θ -θ (k) ; θ (τ k i ) ) = L (k) (θ (k) , θ -θ (k) ) ≥0 -L (k) (θ (k) , θ -θ (k) ) + 1 n n i=1 L i (θ (k) , θ -θ (k) ; θ (τ k i ) ) ≥ 1 n n i=1 L i (θ (k) , θ -θ (k) ; θ (τ k i ) ) - 1 M (τ k i ) M (τ k i ) m=1 r i (θ (k) , θ -θ (k) ; θ (τ k i ) , z (τ k i ) i,m ) , where the inequality is due to the optimality of θ (k) and the convexity of L (k) (θ) [cf. H3]. Denoting a scaled version of the above term as: (k) (θ) := 1 n n i=1 1 M (τ k i ) M (τ k i ) m=1 r i (θ (k) , θ -θ (k) ; θ (τ k i ) , z (τ k i ) i,m ) -L i (θ (k) , θ -θ (k) ; θ (τ k i ) ) θ (k) -θ . We have g (k) ≥ -∇ e (k) (θ (k) ) + inf θ∈Θ (-(k) (θ)) ≥ -∇ e (k) (θ (k) ) -sup θ∈Θ | (k) (θ)| . ( ) Since g (k) = g (k) + -g (k) -and g (k) + g (k) -= 0, this implies g (k) -≤ ∇ e (k) (θ (k) ) + sup θ∈Θ | (k) (θ)| . ( ) Consider the above inequality when k = K, i.e., the random index, and taking total expectations on both sides gives E[g (K) -] ≤ E[ ∇ e (K) (θ (K) ) ] + E[sup θ∈Θ (K) (θ)] . We note that E[ ∇ e (K) (θ (K) ) ] 2 ≤ E[ ∇ e (K) (θ (K) ) 2 ] ≤ ∆(K max ) K max , where the first inequality is due to the convexity of (•) 2 and the Jensen's inequality, and E[sup θ∈Θ (K) (θ)] = 1 K max Kmax k=0 E[sup θ∈Θ (k) (θ)] (a) ≤ C gr K max Kmax-1 k=0 E 1 n n i=1 M -1/2 (τ k i ) (b) ≤ C gr K max Kmax-1 k=0 M -1/2 (k) , where (a) is due to H4 and (b) is due to (24). This implies E[g (K) -] ≤ ∆ (Kmax) K max + C gr K max Kmax-1 k=0 M -1/2 (k) , and concludes the proof of the theorem.

A.2 PROOF OF THEOREM 2

Theorem. Under H1-H4. In addition, assume that {M (k) } k≥0 is a non-decreasing sequence of integers which satisfies ∞ k=0 M -1/2 (k) < ∞. Then: 1. the negative part of the stationarity measure converges a.s. to zero, i.e., lim k→∞ g -(θ (k) ) a.s. = 0. 2. the objective value L(θ (k) ) converges a.s. to a finite number L, i.e., lim k→∞ L(θ (k) ) a.s.

= L.

Proof We apply the following auxiliary lemma which proof can be found in Appendix A.3 for the readability of the current proof: Lemma 1. Let (V k ) k≥0 be a non negative sequence of random variables such that E[V 0 ] < ∞. Let (X k ) k≥0 a non negative sequence of random variables and (E k ) k≥0 be a sequence of random variables such that ∞ k=0 E[|E k |] < ∞. If for any k ≥ 1: V k ≤ V k-1 -X k-1 + E k-1 (i) for all k ≥ 0, E[V k ] < ∞ and the sequence (V k ) k≥0 converges a.s. to a finite limit V ∞ . (ii) the sequence (E[V k ]) k≥0 converges and lim k→∞ E[V k ] = E[V ∞ ]. (iii) the series ∞ k=0 X k converges almost surely and ∞ k=0 E[X k ] < ∞. We proceed from ( 19) by re-arranging terms and observing that L (k+1) (θ (k+1) ) ≤ L (k) (θ (k) ) -1 n L i k (θ (k) ; θ (τ k i k ) ) -L i k (θ (k) ; θ (k) ) -L (k+1) (θ (k+1) ) -L (k+1) (θ (k+1) ) + L (k) (θ (k) ) -L (k) (θ (k) ) + 1 n L i k (θ (k) ; θ (k) , {z (k) i k ,m } M (k) m=1 ) -L i k (θ (k) ; θ (k) ) + 1 n L i k (θ (k) ; θ (τ k i k ) ) -L i k (θ (k) ; θ (τ k i k ) , {z (τ k i k ) i k ,m } M (τ k i k ) m=1 ) . Our idea is to apply Lemma 1. Under H1, the finite sum of surrogate functions L (k) (θ), defined in (15), is lower bounded by a constant c k > -∞ for any θ. To this end, we observe that V k := L (k) (θ (k) ) -inf k≥0 c k ≥ 0 is a non-negative random variable. Secondly, under H1, the following random variable is non-negative X k := 1 n L i k (θ (τ k i k ) ; θ (k) ) -L i k (θ (k) ; θ (k) ) ≥ 0 . Thirdly, we define E k = -L (k+1) (θ (k+1) ) -L (k+1) (θ (k+1) ) + L (k) (θ (k) ) -L (k) (θ (k) ) + 1 n L i k (θ (k) ; θ (k) , {z (k) i k ,m } M (k) m=1 ) -L i k (θ (k) ; θ (k) ) + 1 n L i k (θ (k) ; θ (τ k i k ) ) -L i k (θ (k) ; θ (τ k i k ) , {z (τ k i k ) i k ,m } M (τ k i k ) m=1 ) . Note that from the definitions (28), ( 29), (30), we have V k+1 ≤ V k -X k + E k for any k ≥ 1. Under H4, we observe that E | L i k (θ (k) ; θ (k) , {z (k) i k ,m } M (k) m=1 ) -L i k (θ (k) ; θ (k) )| ≤ C r M -1/2 (k) E L i k (θ (k) ; θ (τ k i k ) ) -L i k (θ (k) ; θ (τ k i k ) , {z (τ k i k ) i k ,m } M (τ k i k ) m=1 ) ≤ C r E M -1/2 (τ k i k ) E | L (k) (θ (k) ) -L (k) (θ (k) )| ≤ 1 n n i=1 C r E M -1/2 (τ k i ) Therefore, E |E k | ≤ Cr n M -1/2 (k) + E M -1/2 (τ k i k ) + n i=1 M -1/2 (τ k i ) + M -1/2 (τ k+1 i ) . Using ( 24) and the assumption on the sequence {M (k) } k≥0 , we obtain that ∞ k=0 E |E k | < C r n (2 + 2n) ∞ k=0 M -1/2 (k) < ∞. Therefore, the conclusions in Lemma 1 hold. Precisely, we have ∞ k=0 X k < ∞ and ∞ k=0 E[X k ] < ∞ almost surely. Note that this implies ∞ > ∞ k=0 E[X k ] = 1 n ∞ k=0 E L i k (θ (k) ; θ (τ k i k ) ) -L i k (θ (k) ; θ (k) ) = 1 n ∞ k=0 E L (k) (θ (k) ) -L(θ (k) ) = 1 n ∞ k=0 E e (k) (θ (k) ) . Since e (k) (θ (k) ) ≥ 0, the above implies lim k→∞ e (k) (θ (k) ) = 0 a.s. and subsequently applying (18), we have lim k→∞ e (k) (θ (k) ) = 0 almost surely. Finally, it follows from ( 18) and ( 26) that lim k→∞ g (k) -≤ lim k→∞ √ 2L e (k) (θ (k) ) + lim k→∞ sup θ∈Θ | (k) (θ)| = 0 , where the last equality holds almost surely due to the fact that ∞ k=0 E[sup θ∈Θ | (k) (θ)|] < ∞. This concludes the asymptotic convergence of the MISSO method. Finally, we prove that L(θ (k) ) converges almost surely. As a consequence of Lemma 1, it is clear that {V k } k≥0 converges almost surely and so is { L (k) (θ (k) )} k≥0 , i.e., we have lim k→∞ L (k) (θ (k) ) = L. Applying (31) implies that L = lim k→∞ L (k) (θ (k) ) = lim k→∞ L(θ (k) ) a.s. This shows that L(θ (k) ) converges almost surely to L.

A.3 PROOF OF LEMMA 1

Lemma. Let (V k ) k≥0 be a non negative sequence of random variables such that E[V 0 ] < ∞. Let (X k ) k≥0 a non negative sequence of random variables and (E k ) k≥0 be a sequence of random variables such that ∞ k=0 E[|E k |] < ∞. If for any k ≥ 1: V k ≤ V k-1 -X k-1 + E k-1 then: (i) for all k ≥ 0, E[V k ] < ∞ and the sequence (V k ) k≥0 converges a.s. to a finite limit V ∞ . (ii) the sequence (E[V k ]) k≥0 converges and lim k→∞ E[V k ] = E[V ∞ ]. (iii) the series ∞ k=0 X k converges almost surely and ∞ k=0 E[X k ] < ∞. Proof We first show that for all k ≥ 0, E[V k ] < ∞. Note indeed that: 0 ≤ V k ≤ V 0 - k j=1 X j + k j=1 E j ≤ V 0 + k j=1 E j , showing that E[V k ] ≤ E[V 0 ] + E k j=1 E j < ∞. Since 0 ≤ X k ≤ V k-1 -V k + E k we also obtain for all k ≥ 0, E[X k ] < ∞. Moreover, since E ∞ j=1 |E j | < ∞, the series ∞ j=1 E j converges a.s. We may therefore define: W k = V k + ∞ j=k+1 E j (34) Note that E[|W k |] ≤ E[V k ] + E ∞ j=k+1 |E j | < ∞. For all k ≥ 1, we get: W k ≤ V k-1 -X k + ∞ j=k E j ≤ W k-1 -X k ≤ W k-1 E[W k ] ≤ E[W k-1 ] -E[X k ] . Hence the sequences (W k ) k≥0 and (E[W k ]) k≥0 are non increasing. Since for all k ≥ 0, W k ≥ - ∞ j=1 |E j | > -∞ and E[W k ] ≥ - ∞ j=1 E[|E j |] > -∞, the (random) sequence (W k ) k≥0 converges a.s. to a limit W ∞ and the (deterministic) sequence (E[W k ]) k≥0 converges to a limit w ∞ . Since |W k | ≤ V 0 + ∞ j=1 |E j |, the Fatou lemma implies that: E[lim inf k→∞ |W k |] = E[|W ∞ |] ≤ lim inf k→∞ E[|W k |] ≤ E[V 0 ] + ∞ j=1 E[|E j |] < ∞ , showing that the random variable W ∞ is integrable. In the sequel, set U k W 0 -W k . By construction we have for all k ≥ 0, U k ≥ 0, U k ≤ U k+1 and E[U k ] ≤ E[|W 0 |] + E[|W k |] < ∞ and by the monotone convergence theorem, we get: lim k→∞ E[U k ] = E[ lim k→∞ U k ] . Finally, we have: lim k→∞ E[U k ] = E[W 0 ] -w ∞ and E[ lim k→∞ U k ] = E[W 0 ] -E[W ∞ ] . showing that E[W ∞ ] = w ∞ and concluding the proof of (ii). Moreover, using (35) we have that W k ≤ W k-1 -X k which yields: ∞ j=1 X j ≤ W 0 -W ∞ < ∞ , ∞ j=1 E[X j ] ≤ E[W 0 ] -w ∞ < ∞ , an concludes the proof of the lemma.

B PRACTICAL DETAILS FOR THE BINARY LOGISTIC REGRESSION ON THE TRAUMABASE B.1 TRAUMABASE DATASET QUANTITATIVE VARIABLES

The list of the 16 quantitative variables we use in our experiments are as follows -age, weight, height, BMI (Body Mass Index), the Glasgow Coma Scale, the Glasgow Coma Scale motor component, the minimum systolic blood pressure, the minimum diastolic blood pressure, the maximum number of heart rate (or pulse) per unit time (usually a minute), the systolic blood pressure at arrival of ambulance, the diastolic blood pressure at arrival of ambulance, the heart rate at arrival of ambulance, the capillary Hemoglobin concentration, the oxygen saturation, the fluid expansion colloids, the fluid expansion cristalloids, the pulse pressure for the minimum value of diastolic and systolic blood pressure, the pulse pressure at arrival of ambulance.

B.2 METROPOLIS-HASTINGS ALGORITHM

During the simulation step of the MISSO method, the sampling from the target distribution π(z i,mis ; θ) := p(z i,mis |z i,obs , y i ; θ) is performed using a Metropolis-Hastings (MH) algorithm (Meyn & Tweedie, 2012) with proposal distribution q(z i,mis ; δ) := p(z i,mis |z i,obs ; δ) where θ = (β, Ω) and δ = (ξ, Σ). The parameters of the Gaussian conditional distribution of z i,mis |z i,obs read: ξ = β miss + Ω mis,obs Ω -1 obs,obs (z i,obs -β obs ) , Σ = Ω mis,mis + Ω mis,obs Ω -1 obs,obs Ω obs,mis , where we have used the Schur Complement of Ω obs,obs in Ω and noted β mis (resp. β obs ) the missing (resp. observed) elements of β. The MH algorithm is summarized in Algorithm 3. Algorithm 3 MH aglorithm 1: Input: initialization z i,mis,0 ∼ q(z i,mis ; δ) 2: for m = 1, • • • , M do 3: Sample z i,mis,m ∼ q(z i,mis ; δ) 4: Sample u ∼ U( 0, 1 ) 5: Calculate the ratio r = π(zi,mis,m;θ)/q(zi,mis,m);δ) π(zi,mis,m-1;θ)/q(zi,mis,m-1);δ)

6:

if u < r then  L i (θ; θ) = Z log p i (z i,mis , θ)/f i (z i,mis , θ) p i (z i,mis , θ)µ i (dz i ) . where θ = (δ, β, Ω) and θ = ( δ, β, Ω). We adapt it to our missing covariates problem and decompose the surrogate function defined above into an observed and a missing part.

Surrogate function decomposition

We adapt it to our missing covariates problem and decompose the term depending on θ, while θ is fixed, in two following parts leading to L i (θ; θ) =- Z log f i (z i,mis , z i,obs , θ)p i (z i,mis , θ)µ i (dz i,mis ) =- Z log [p i (y i |z i,mis , z i,obs , δ)p i (z i,mis , β, Ω)] p i (z i , θ)µ i (dz i,mis ) = - Z log p i (y i |z i,mis , z i,obs , δ)p i (z i , θ)µ i (dz i,mis ) = L(1) i (δ,θ) - Z log p i (z i,mis , β, Ω)p i (z i , θ)µ i (dz i,mis ) = L(2) i (β,Ω,θ) . The mean β and the covariance Ω of the latent structure can be estimated minimizing the sum of MISSO surrogates L(2) i (β, Ω, θ, {z m } M m=1 ), defined as MC approximation of L(2) i (β, Ω, θ), for all i ∈ n , in closed-form expression. We thus keep the surrogate L(2) i (β, Ω, θ) as it is, and consider the following quadratic approximation of L(1) i (δ, θ) to estimate the vector of logistic parameters δ: L(1) i ( δ, θ)- Z ∇ log p i (y i |z i,mis , z i,obs , δ) δ= δ p i (z i,mis , θ)µ i (dz i,mis )(δ -δ) -(δ -δ)/2 Z ∇ 2 log p i (y i |z i,mis , z i,obs , δ)p i (z i,mis , θ)p i (z i,mis , θ)µ i (dz i,mis )(δ -δ) . Recall that: ∇ log p i (y i |z i,mis , z i,obs , δ) = z i y i -S(δ z i ) , ∇ 2 log p i (y i |z i,mis , z i,obs , δ) = -z i z i Ṡ(δ z i ) , where Ṡ(u) is the derivative of S(u). Note that Ṡ(u) ≤ 1/4 and since, for all i ∈ n , the p × p matrix z i z i is semi-definite positive we can assume that: L1. For all i ∈ n and > 0, there exist, for all z i ∈ Z, a positive definite matrix H i (z i ) := 1 4 (z i z i + I d ) such that for all δ ∈ R p , -z i z i Ṡ(δ z i ) ≤ H i (z i ). Then, we use, for all i ∈ n , the following surrogate function to estimate δ: L(1) i (δ, θ) = L(1) i ( δ, θ) -D i (δ -δ) + 1 2 (δ -δ)H i (δ -δ) , where: D i = Z ∇ log p i (y i |z i,mis , z i,obs , δ) δ= δ p i (z i,mis , θ)µ i (dz i,mis ) , H i = Z H i (z i,mis )p i (z i,mis , θ)µ i (dz i,mis ) . Finally, at iteration k, the total surrogate is: L(k) (θ) = 1 n n i=1 Li (θ, θ (τ k i ) , {z i,m } M (τ k i ) m=1 ) = 1 n n i=1 L(2) i (β, Ω, θ (τ k i ) , {z i,m } M (τ k i ) m=1 ) - 1 n n i=1 D(τ k i ) i (δ -δ (τ k i ) ) + 1 2n n i=1 (δ -δ (τ k i ) ) H(τ k i ) i (δ -δ (τ k i ) ) , where for all i ∈ n : D(τ k i ) i = 1 M (τ k i ) M (τ k i ) m=1 z (τ k i ) i,m y i -S( δ (τ k i ) z i,m (τ k i )) , H(τ k i ) i = 1 4M (τ k i ) M (τ k i ) m=1 z (τ k i ) i,m (z (τ k i ) i,m ) . Minimizing the total surrogate (42) boils down to performing a quasi-Newton step. It is perhaps sensible to apply some diagonal loading which is perfectly compatible with the surrogate interpretation we just gave. The logistic parameters are estimated as follows: δ (k) = arg min δ∈Θ 1 n n i=1 L(1) i (δ, θ (τ k i ) , {z i,m } M (τ k i ) m=1 ) , where L(1) i (δ, θ (τ k i ) , {z i,m } M (τ k i ) m=1 ) is the MC approximation of the MISO surrogate defined in (41) and which leads to the following quasi-Newton step: δ (k) = 1 n n i=1 δ (τ k i ) -( H(k) ) -1 D(k) , with D(k) = 1 n n i=1 D(τ k i ) i and H(k) = 1 n n i=1 H(τ k i ) i . MISSO updates: At the k-th iteration, and after the initialization, for all i ∈ n , of the latent variables (z (0) i ), the MISSO algorithm consists in picking an index i k uniformly on n , completing the observations by sampling a Monte Carlo batch {z (k) i k ,mis,m } M (k) m=1 of missing values from the conditional distribution p(z i k ,mis |z i k ,obs , y i k ; θ (k-1) ) using an MCMC sampler and computing the estimated parameters as follows: β (k) = arg min β∈Θ 1 n n i=1 L(2) i (β, Ω (k) , θ (τ k i ) , {z i,m } M (τ k i ) m=1 ) = 1 n n i=1 1 M (τ k i ) M (τ k i ) m=1 z (k) i,m , Ω (k) = arg min Ω∈Θ 1 n n i=1 L(2) i (β (k) , Ω, θ (τ k i ) , {z i,m } M (τ k i ) m=1 ) = 1 n n i=1 1 M (τ k i ) M (τ k i ) m=1 w (k) i,m , δ (k) = 1 n n i=1 δ (τ k i ) -( H(k) ) -1 D(k) . ( ) where z 

B.4 WALL CLOCK TIME

We provide Table 1 , the running time for each method, plotted in Figure 1 , employed to train a logistic regression with missing values on the TraumaBase dataset (p = 16 influential quantitative measurements, on n = 6384 patients). The running times are sensibly the same since for each method the computation complexity per epoch is similar. We remark a slight delay using the MISSO method with a batch size of 1, as our code implemented in R, is not totally optimized and parallelized. Yet, when the batch size tends to 100%, we retrieve the duration of MCEM, which is consistent with the fact that MISSO with a full batch update boils down to the MCEM algorithm. We plot Figure 3 , the updated parameters for the Logistic regression example against the time elapsed (in seconds). (0) for ∈ d and variance estimates σ (0) . At iteration k, minimizing the sum of stochastic surrogates defined as in ( 6) and ( 13) yields the following MISSO update -

C.3 WALL CLOCK TIME

We provide Table 4 , the running time for each method, plotted in Figure 2 , used to train a Bayesian variant of LeNet-5 on MNIST. The incremental method as MISSO and MC-SAG displays a similar wall clock time, despite being a bit worse given (a) the initialization that requires to compute a vector of n gradients kept in memory and updated through the iterations and (b) the average operation for each parameters update to compute the aggregated drift term, see (44). 



Figure 1: Convergence of parameters δ and β for the SAEM, the MCEM and the MISSO methods. The convergence is plotted against number of passes over the data.We compare in Figure1the convergence behavior of the estimated parameters δ and β using SAEM(Delyon et al., 1999) (with stepsize γ k = 1/k α where α = 0.6 after tuning), MCEM (Wei

for 12: Output: z i,mis,M B.3 MISSO UPDATE Choice of surrogate function for MISO: We recall the MISO deterministic surrogate defined in (7):

,m , z i,obs ) is composed of a simulated and an observed part,D(k) = ) -β (k) (β (k) ) . Besides, L(1) i (β, Ω, θ, {z m } M m=1 ) and L(2) i (β, Ω, θ, {z m } M m=1) are defined as MC approximation of L(1) i (β, Ω, θ) and L(2) i (β, Ω, θ), for all i ∈ n as components of the surrogate function (40).

Figure 3: Convergence of parameters δ and β for the SAEM, the MCEM and the MISSO methods. The convergence is plotted against time elapsed (in seconds).

Bayesian Deep Neural Network: running time in seconds for 100 epochs. MC-Adam MC-Momentum BBB MC-4, the learning curves for the MNIST example against the time elapsed (in seconds).

Figure 4: Negated ELBO versus wall clock time for fitting a Bayesian LeNet-5 on MNIST. Plotted on the average of the 5 repetitions.

Logistic Regression with missing values: running time in seconds for 10 epochs.

Table 2 the architecture of the Convolutional Neural Network introduced in (LeCun et al., 1998) and trained on MNIST: LeNet-5 architecture Bayesian ResNet-18 Architecture: We describe in Table 3 the architecture of the Resnet-18 we train on CIFAR-10:

annex

step (i) pick a function index i k uniformly on n ; step (ii) sample a Monte Carlo batch {zfrom N (0, I); and step (iii) update the parameters aswhere we define the following gradient terms for all i ∈ 1, n :Note that our analysis in the main text does require the parameter to be in a compact set. For the current estimation problem considered, this can be enforced in practice by restricting the parameters in a ball. In our simulation for the BNNs example, we did not implement the algorithms that stick closely to the compactness requirement for illustrative purposes. However, we observe empirically that the parameters are always bounded. The update rules can be easily modified to respect the requirement. For the considered VI problem, we recall the surrogate functions ( 11) are quadratic and indeed a simple projection step suffices to ensure boundedness of the iterates.For all benchmark algorithms, we pick, at iteration k, a function index i k uniformly on n and sample a Monte Carlo batch {zm=1 from the standard Gaussian distribution. The updates of the parameters µ for all ∈ d and σ break down as follows:Monte Carlo SAG update: Setfor i = i k and are defined by ( 45) for i = i k . The learning rate is set to γ = 10 -3 .Bayes By Backprop update: Setwhere the learning rate γ = 10 -3 .Monte Carlo Momentum update: Setσ,i k , where α and γ, respectively the momentum and the learning rates, are set to 10 -3 .Monte Carlo ADAM update: SetwhereThe hyperparameters are set as follows: γ = 10 -3 , ρ 1 = 0.9, ρ 2 = 0.999, = 10 -8 .

