NOVAS: NON-CONVEX OPTIMIZATION VIA ADAPTIVE STOCHASTIC SEARCH FOR END-TO-END LEARNING AND CONTROL

Abstract

In this work we propose the use of adaptive stochastic search as a building block for general, non-convex optimization operations within deep neural network architectures. Specifically, for an objective function located at some layer in the network and parameterized by some network parameters, we employ adaptive stochastic search to perform optimization over its output. This operation is differentiable and does not obstruct the passing of gradients during backpropagation, thus enabling us to incorporate it as a component in end-to-end learning. We study the proposed optimization module's properties and benchmark it against two existing alternatives on a synthetic energy-based structured prediction task, and further showcase its use in stochastic optimal control applications.

1. INTRODUCTION

Deep learning has experienced a drastic increase in the diversity of neural network architectures, both in terms of proposed structure, as well as in the repertoire of operations that define the interdependencies of its elements. With respect to the latter, a significant amount of attention has been devoted to incorporating optimization blocks or modules operating at some part of the network. This has been motivated by large number of applications, including meta-learning (Finn et al., 2017; Rusu et al., 2018; Bartunov et al., 2019) , differentiable physics simulators (de Avila Belbute-Peres et al., 2018 ), classification (Amos et al., 2019 ), GANs (Metz et al., 2016) , reinforcement learning with constraints, latent spaces, or safety (Amos & Kolter, 2017; Srinivas et al., 2018; Amos & Yarats, 2019; Cheng et al., 2019; Pereira et al., 2020) , model predictive control (Amos et al., 2018; Pereira et al., 2018) , as well as tasks relying on the use of energy networks (Belanger et al., 2017; Bartunov et al., 2019) , among many others. Local 1 optimization modules lead to nested optimization operations, as they interact with the global, end-to-end training of the network that contains them. Consider some component within the neural network architecture, e.g. a single layer, whose input and output are x i ∈ R n and x i+1 ∈ R m , respectively. Within that layer, the input and output are linked via the solution of the following optimization problem: x i+1 = arg min x F (x; x i , θ), that is, the output x i+1 is defined as the solution to an optimization problem for which the input x i remains temporarily fixed, i.e., acts as a parameter. Here, F (x; x i , θ) : R m × R n × Θ → R is a function possibly further parameterized by some subset of the neural network parameters θ ∈ Θ. Note that x here is an independent variable which is free to vary. The result of this optimization could potentially also be subject to a set of (input-dependent) constraints, though in this paper we will consider only unconstrained optimization. It is also important to note that, depending on the problem, F can be a given function, or it can itself be represented by a multi-layer neural network (trained by the outer loop), in which case the aforementioned optimization layer consists of multiple sub-layers and is more accurately described as a module rather than a single layer. Examples of this type of optimization are structured prediction energy networks (e.g. Belanger et al. ( 2017)); another such example is Amos & Kolter (2017) which treats the case of convex F (•; x i , θ). In order to facilitate end-to-end learning over the entire network, the gradient of its loss function L with respect to θ will require during backpropagation passing the gradient of the module's output x i+1 with respect to parameters θ and x i . Depending on the nature of the optimization problem under consideration, several procedures have been suggested; among them, particularly appealing is the case of convex optimization (Gould et al., 2016; Johnson et al., 2016; Amos et al., 2017; Amos & Kolter, 2017) , in which the aforementioned gradients can be computed efficiently through an application of the implicit function theorem to a set of optimality conditions, such as the KKT conditions. In the case of non-convex functions however, obtaining such gradients is not as straight-forward; solutions involve either forming and solving a locally convex approximation of the problem, or unrolling gradient descent (Domke, 2012; Metz et al., 2016; Belanger et al., 2017; Finn et al., 2017; Srinivas et al., 2018; Rusu et al., 2018; Foerster et al., 2018; Amos et al., 2018) . Unrolling gradient descent approximates the arg min operator with a fixed number of gradient descent iterations during the forward pass and interprets these as an unrolled compute graph that can be differentiated through during the backward pass. One drawback in using this unrolled gradient descent operation however is the fact that doing so can lead to over-fitting to the selected gradient descent hyper-parameters, such as learning rate and number of iterations. Recently, Amos & Yarats (2019) demonstrated promising results in alleviating this phenomenon by replacing these iterations of gradient descent by iterations of sampling-based optimization, in particular a differentiable approximation of the cross-entropy method. While still unrolling the graph created by the fixed number of iterations, they showed empirically that no over-fitting to the hyper-parameters occurred by performing inference on the trained network with altered inner-loop optimization hyper-parameters. Another significant bottleneck in all methods involving graph unrolling is the number of iterations, which has to be kept low to prevent a prohibitively large graph during backprop, to avoid issues in training. Note that in eq. ( 1) the variable of optimization is free to vary independently of the network. This is in contrast to many applications involving nested optimization, mainly in the field of meta-learning, in which the inner loop, rather than optimizing a free variable, performs adaptation to an initial value which is supplied to the inner loop by the outer part of the network. For example, MAML (Finn et al., 2017) performs the inner-loop adaptation θ → θ , in which the starting point θ is not arbitrary (as x is in eq. ( 1)) but is supplied by the network. Thus, in the context of adaptation, unrolling the inner-loop graph during back-prop is generally necessary to trace the adaptation back to the particular network-supplied initial value. Two notable exceptions are first-order MAML (Finn et al., 2017; Nichol et al., 2018) , which ignores second derivative terms, and implicit MAML (Rajeswaran et al., 2019) , which relies on local curvature estimation. In this paper we propose Non-convex Optimization Via Adaptive Stochastic Search (NOVAS), a module for differentiable, non-convex optimization. The backbone of this module is adaptive stochastic search (Zhou & Hu, 2014), a sampling-based method within the field of stochastic optimization. The contributions of our work are as follows: (A). We demonstrate that the NOVAS module does not over-fit to optimization hyper-parameters and offers improved speed and convergence rate over its alternative (Amos & Yarats, 2019). (B). If the inner-loop variable of optimization is free to vary (i.e., the problem fits the definition given by eq. ( 1)), we show that there is no need to unroll the graph during the back-propagation of gradients. The latter advantage is critical, as it drastically reduces the size of the overall end-to-end computation graph, thus facilitating improved ability to learn with higher convergence rates, improved speed, and reduced memory requirements. Furthermore, it allows us to use a higher number of inner-loop iterations. (C). If the inner-loop represents an adaptation to a network-supplied value as it is the case in meta-learning applications, NOVAS may still be used in lieu of the gradient descent rule (though unrolling the graph may be necessary here). Testing NOVAS in such a setting is left for future work. (D). We combine the NOVAS module with the framework of deep FBSDEs, a neural network-based approach to solving nonlinear partial differential equations (PDEs). This combination allows us to solve Hamilton-Jacobi-Bellman (HJB) PDEs of the most general form, i.e., those in which the min operator does not have a closed-form solution, a class of problems that was previously impossible to address due to the non-convexity of



* Equal contribution. 1 To distinguish between the optimization of the entire network as opposed to that of the optimization module, we frequently refer to the former as global or outer-loop optimization and to the latter as local or inner-loop optimization.

