OPTIMAL NEURAL NETWORK APPROXIMATION OF WASSERSTEIN GRADIENT DIRECTION VIA CONVEX OPTIMIZATION

Abstract

The computation of Wasserstein gradient direction is essential for posterior sampling problems and scientific computing. The approximation of the Wasserstein gradient with finite samples requires solving a variational problem. We study the variational problem in the family of two-layer networks with squared-ReLU activations, towards which we derive a semi-definite programming (SDP) relaxation. This SDP can be viewed as an approximation of the Wasserstein gradient in a broader function family including two-layer networks. By solving the convex SDP, we obtain the optimal approximation of the Wasserstein gradient direction in this class of functions. We also propose practical algorithms using subsampling and dimension reduction. Numerical experiments including PDEconstrained Bayesian inference and parameter estimation in COVID-19 modeling demonstrate the effectiveness and efficiency of the proposed method.

1. INTRODUCTION

Bayesian inference plays an essential role in learning model parameters from the observational data with applications in inverse problems, scientific computing, information science, and machine learning (Stuart, 2010) . The central problem in Bayesian inference is to draw samples from a posterior distribution, which characterizes the parameter distribution given data and a prior distribution. The Wasserstein gradient flow (Otto, 2001; Ambrosio et al., 2005; Junge et al., 2017) has shown to be effective in drawing samples from a posterior distribution, which attracts increasing attention in recent years. For instance, the Wasserstein gradient flow of Kullback-Leibler (KL) divergence connects to the overdampled Langevin dynamics. The time-discretization of the overdamped Langevin dynamics renders the classical Langevin Monte Carlo Markov Chain (MCMC) algorithm. In this sense, the computation of Wasserstein gradient flow yields a different viewpoint for sampling algorithms. In particular, the Wasserstein gradient direction also provides a deterministic update of the particle system (Carrillo et al., 2021b) . Based on the approximation or generalization of the Wasserstein gradient direction, many efficient sampling algorithms have been developed, including Wasserstein gradient descent (WGD) with kernel density estimation (KDE) (Liu et al., 2019) , Stein variational gradient descent (SVGD) (Liu & Wang, 2016) , and neural variational gradient descent (di Langosco et al., 2021) , etc. Meanwhile, neural networks exhibit tremendous optimization and generalization performance in learning complicated functions from data. They also have wide applications in Bayesian inverse problems (Rezende & Mohamed, 2015; Onken et al., 2020; Kruse et al., 2019; Lan et al., 2021) . According to the universal approximation theorem of neural networks (Hornik et al., 1989; Lu et al., 2017) , any arbitrarily complicated functions can be learned by a two-layer neural network with nonlinear activations and a sufficient number of neurons. Functions represented by neural networks naturally provide an approximation towards the Wasserstein gradient direction. However, due to the nonlinear and nonconvex structure of neural networks, optimization algorithms including stochastic gradient descent may not find the global optima of the training problem. Recently, based on a line of works (Pilanci & Ergen, 2020; Sahiner et al., 2020; Bartan & Pilanci, 2021a) , the regularized training problem of two-layer neural networks with ReLU/polynomial activation can be formulated as a convex program. Indeed, by solving the convex program, we can construct the entire set of global optima of the nonconvex training problem (Wang et al., 2020) . Theoretical analysis (Wang et al., 2022) shows that global optima of the training problem correspond to the simplest models with good generalization properties. Moreover, numerical results (Pilanci & Ergen, 2020) show that neural networks found by solving the convex program can achieve higher train accuracy and test accuracy compared to neural networks trained by SGD with the same number of parameters. In this paper, we study a variational problem, whose optimal solution corresponds to the Wasserstein gradient direction. Focusing on the family of two-layer neural networks with squared ReLU activation, we formulate the regularized variational problem in terms of samples. Directly training the neural network to minimize the loss may get the neural network stuck at local minima or saddle points and it often leads to biased sample distribution from the posterior. Instead, we analyze the convex dual problem of the training problem and study its semi-definite program (SDP) relaxation by analyzing the geometry of dual constraints. The resulting SDP can be efficiently solved by convex optimization solvers such as CVXPY (Diamond & Boyd, 2016) . We then derive the corresponding relaxed bidual problem (dual of the relaxed dual problem). Thus, the optimal solution to the dual problem yields an optimal approximation of the Wasserstein gradient direction in a broader function family. We also analyze the choice of the regularization parameter and present a practical implementation using subsampling and parameter dimension reduction to improve computational efficiency. Numerical results for experiments including PDE-constrained inference problems and Covid-19 parameter estimation problems illustrate the effectiveness and efficiency of our method.

1.1. RELATED WORKS

The time and spatial discretizations of Wasserstein gradient flows are extensively studied in literature (Jordan et al., 1998; Junge et al., 2017; Carrillo et al., 2021a; b; Bonet et al., 2021; Liutkus et al., 2019; Frogner & Poggio, 2020) . Recently, neural networks have been applied in solving or approximating Wasserstein gradient flows (Mokrov et al., 2021; Lin et al., 2021b; a; Alvarez-Melis et al., 2021; Bunne et al., 2021; Hwang et al., 2021; Fan et al., 2021) . For sampling algorithms, di Langosco et al. ( 2021) learns the transportation function by solving an unregularized variational problem in the family of vector-output deep neural networks. Compared to these studies, we focus on a convex SDP relaxation of the varitional problem induced by the Wasserstein gradient direction. Meanwhile, Feng et al. (2021) form the Wasserstein gradient direction as the mininimizer the Bregman score and they apply deep neural networks to solve the induced variational problem. In comparison to previous works on the convex optimization formulations of neural networks using SDP (Bartan & Pilanci, 2021a; b) , they focus on the polynomial activation and give the exact convex optimization formulation (instead of convex relaxation). In comparison, we focus on the neural networks with the squared ReLU activation, which has not been considered before. Our method can also apply to the analysis of supervised learning problem using squared ReLU activated neural networks.

2. BACKGROUND

In this section, we briefly review the Wasserstein gradient descent and present its variational formulation. In particular, we focus on the Wasserstein gradient descent direction of KL divergence functional. Later on, we design a neural network convex optimization problem to approximate the Wasserstein gradient in samples.

2.1. WASSERSTEIN GRADIENT DESCENT

Consider an optimization problem in the probability space: 



inf ρ∈P D KL (ρ π) = ρ(x)(log ρ(x) -log π(x))dx,(1)Here the integral is taken over R d and the objective functional D KL (ρ π) is the KL divergence from ρ to π. The variable is the density function ρ in the spaceP = {ρ ∈ C ∞ (R d )| ρdx = 1, ρ > 0}. The function π ∈ C ∞ (R d) is a known probability density function of the posterior distribution. By solving the optimization problem (1), we can generate samples from the posterior distribution.

