NOVAS: NON-CONVEX OPTIMIZATION VIA ADAPTIVE STOCHASTIC SEARCH FOR END-TO-END LEARNING AND CONTROL

Abstract

In this work we propose the use of adaptive stochastic search as a building block for general, non-convex optimization operations within deep neural network architectures. Specifically, for an objective function located at some layer in the network and parameterized by some network parameters, we employ adaptive stochastic search to perform optimization over its output. This operation is differentiable and does not obstruct the passing of gradients during backpropagation, thus enabling us to incorporate it as a component in end-to-end learning. We study the proposed optimization module's properties and benchmark it against two existing alternatives on a synthetic energy-based structured prediction task, and further showcase its use in stochastic optimal control applications.

1. INTRODUCTION

Deep learning has experienced a drastic increase in the diversity of neural network architectures, both in terms of proposed structure, as well as in the repertoire of operations that define the interdependencies of its elements. With respect to the latter, a significant amount of attention has been devoted to incorporating optimization blocks or modules operating at some part of the network. This has been motivated by large number of applications, including meta-learning (Finn et al., 2017; Rusu et al., 2018; Bartunov et al., 2019) , differentiable physics simulators (de Avila Belbute-Peres et al., 2018 ), classification (Amos et al., 2019 ), GANs (Metz et al., 2016) , reinforcement learning with constraints, latent spaces, or safety (Amos & Kolter, 2017; Srinivas et al., 2018; Amos & Yarats, 2019; Cheng et al., 2019; Pereira et al., 2020) , model predictive control (Amos et al., 2018; Pereira et al., 2018) , as well as tasks relying on the use of energy networks (Belanger et al., 2017; Bartunov et al., 2019) , among many others. Local 1 optimization modules lead to nested optimization operations, as they interact with the global, end-to-end training of the network that contains them. Consider some component within the neural network architecture, e.g. a single layer, whose input and output are x i ∈ R n and x i+1 ∈ R m , respectively. Within that layer, the input and output are linked via the solution of the following optimization problem: x i+1 = arg min x F (x; x i , θ), that is, the output x i+1 is defined as the solution to an optimization problem for which the input x i remains temporarily fixed, i.e., acts as a parameter. Here, F (x; x i , θ) : R m × R n × Θ → R is a function possibly further parameterized by some subset of the neural network parameters θ ∈ Θ. Note that x here is an independent variable which is free to vary. The result of this optimization could potentially also be subject to a set of (input-dependent) constraints, though in this paper we



* Equal contribution. 1 To distinguish between the optimization of the entire network as opposed to that of the optimization module, we frequently refer to the former as global or outer-loop optimization and to the latter as local or inner-loop optimization.

