DDPNOPT: DIFFERENTIAL DYNAMIC PROGRAMMING NEURAL OPTIMIZER

Abstract

Interpretation of Deep Neural Networks (DNNs) training as an optimal control problem with nonlinear dynamical systems has received considerable attention recently, yet the algorithmic development remains relatively limited. In this work, we make an attempt along this line by reformulating the training procedure from the trajectory optimization perspective. We first show that most widely-used algorithms for training DNNs can be linked to the Differential Dynamic Programming (DDP), a celebrated second-order method rooted in the Approximate Dynamic Programming. In this vein, we propose a new class of optimizer, DDP Neural Optimizer (DDP-NOpt), for training feedforward and convolution networks. DDPNOpt features layer-wise feedback policies which improve convergence and reduce sensitivity to hyper-parameter over existing methods. It outperforms other optimal-control inspired training methods in both convergence and complexity, and is competitive against state-of-the-art first and second order methods. We also observe DDPNOpt has surprising benefit in preventing gradient vanishing. Our work opens up new avenues for principled algorithmic design built upon the optimal control theory.

1. INTRODUCTION

In this work, we consider the following optimal control problem (OCP) in the discrete-time setting: min ū J( ū; x 0 ) := φ(x T ) + T -1 t=0 t (x t , u t ) s.t. x t+1 = f t (x t , u t ) , where x t ∈ R n and u t ∈ R m represent the state and control at each time step t. f t (•, •), t (•, •) and φ(•) respectively denote the nonlinear dynamics, intermediate cost and terminal cost functions. OCP aims to find a control trajectory, ū {u t } T -1 t=0 , such that the accumulated cost J over the finite horizon t ∈ {0, 1, • • • , T } is minimized. Problems with the form of OCP appear in multidisciplinary areas since it describes a generic multi-stage decision making problem (Gamkrelidze, 2013) , and have gained commensurate interest recently in deep learning (Weinan, 2017; Liu & Theodorou, 2019) . Central to the research along this line is the interpretation of DNNs as discrete-time nonlinear dynamical systems, where each layer is viewed as a distinct time step (Weinan, 2017 ). The dynamical system perspective provides a mathematically-sound explanation for existing DNN models (Lu et al., 2019) . It also leads to new architectures inspired by numerical differential equations and physics (Lu et al., 2017; Chen et al., 2018; Greydanus et al., 2019) . In this vein, one may interpret the training as the parameter identification (PI) of nonlinear dynamics. However, PI typically involves (i) searching time-independent parameters (ii) given trajectory measurements at each time step (Voss et al., 2004; Peifer & Timmer, 2007) . Neither setup holds in piratical DNNs training, which instead optimizes time-(i.e. layer-) varying parameters given the target measurements only at the final stage. An alternative perspective, which often leads to a richer analysis, is to recast network weights as control variables. Through this lens, OCP describes w.l.o.g. the training objective composed of layerwise loss (e.g. weight decay) and terminal loss (e.g. cross-entropy). This perspective (see Table 1 ) has been explored recently to provide theoretical statements for convergence and generalization (Weinan et al., 2018; Seidman et al., 2020) . On the algorithmic side, while OCP has motivated new architectures (Benning et al., 2019) and methods for breaking sequential computation (Gunther et al., 2020; Zhang et al., 2019) , OCP-inspired optimizers remain relatively limited, often restricted to either specific network class (e.g. discrete weight) (Li & Hao, 2018) or small-size dataset (Li et al., 2017) . The aforementioned works are primarily inspired by the Pontryagin Maximum Principle (PMP, Boltyanskii et al. (1960) ), which characterizes the first-order optimality conditions to OCP. Another parallel methodology which receives little attention is the Approximate Dynamic Programming (ADP, Bertsekas et al. (1995) ). Despite both originate from the optimal control theory, ADP differs from PMP in that at each time step a locally optimal feedback policy (as a function of state x t ) is computed. These policies, as opposed to the vector update from PMP, are known to enhance the numerical stability of the optimization process when models admit chain structures (e.g. in DNNs) (Liao & Shoemaker, 1992; Tassa et al., 2012) . Practical ADP algorithms such as the Differential Dynamic Programming (DDP, Jacobson & Mayne (1970) ) appear extensively in modern autonomous systems for complex trajectory optimization (Tassa et al., 2014; Gu, 2017) . However, whether they can be lifted to large-scale stochastic optimization, as in the DNN training, remains unclear. In this work, we make a significant advance toward optimal-control-theoretic training algorithms inspired by ADP. We first show that most existing first-and second-order optimizers can be derived from DDP as special cases. Built upon this intriguing connection, we present a new class of optimizer which marries the best of both. The proposed method, DDP Neural Optimizer (DDPNOpt), features layer-wise feedback policies, which, as we will show through experiments, improve convergence and robustness. To enable efficient training, DDPNOpt adapts key components including (i) curvature adaption from existing methods, (ii) stabilization techniques used in trajectory optimization, and (iii) an efficient factorization to OCP. These lift the complexity by orders of magnitude compared with other OCP-inspired baselines, without sacrificing the performance. In summary, we present the following contributions. 

2. PRELIMINARIES

We will start with the Bellman principle to OCP and leave discussions on PMP in Appendix A.1. Theorem 1 (Dynamic Programming (DP) (Bellman, 1954) ). Define a value function V t : R n → R at each time step that is computed backward in time using the Bellman equation V t (x t ) = min ut(xt)∈Γx t t (x t , u t ) + V t+1 (f t (x t , u t )) Qt(xt,ut)≡Qt , V T (x T ) = φ(x T ) , where Γ xt : R n → R m denotes a set of mapping from state to control space. Then, we have V 0 (x 0 ) = J * (x 0 ) be the optimal objective value to OCP. Further, let µ * t (x t ) ∈ Γ xt be the minimizer of Eq. 1 for each t, then the policy π * = {µ * t (x t )} T -1 t=0 is globally optimal in the closed-loop system. Notation: We will always use t as the time step of dynamics and denote a subsequence trajectory until time s as xs {xt} s t=0 , with x {xt} T t=0 as the whole. For any real-valued time-dependent function Ft, we denote its derivatives evaluated on a given state-control pair (xt ∈ R n and ut ∈ R m ) as ∇x t Ft ∈ R n , ∇ 2 x t Ft ∈ R n×n , ∇x t u t Ft ∈ R n×m , or simply F t x , F t xx , and F t xu for brevity. The vector-tensor product, i.e. the contraction mapping on the dimension of the vector space, is denoted by Vx where [Vx] i is the i-th element of the vector Vx and [fxx]i is the Hessian matrix corresponding to that element. • fxx n i=1 [Vx]i [fxx]i,

Algorithm 1 Differential Dynamic Programming

1: Input: ū {ut} T -1 t=0 , x {xt} T t=0 2: Set V T x = ∇xφ and V T xx = ∇ 2 x φ 3: for t = T -1 to 0 do 4: Compute δu * t (δxt) using V t+1 x , V t+1 xx (Eq. 3, 4) 5: Compute V t x and V t xx using Eq. 5 6: end for 7: Set x0 = x0 8: for t = 0 to T -1 do 9: u * t = ut + δu * t (δxt), where δxt = xt -xt 10: xt+1 = ft( xt, u * t ) 11: end for 12: ū ← {u * t } T -1 t=0 Algorithm 2 Back-propagation (BP) with GD 1: Input: ū {ut} T -1 t=0 , x {xt} T t=0 , learning rate η 2: Set pT ≡ ∇x T JT = ∇xφ 3: for t = T -1 to 0 do 4: δu * t = -η∇u t Jt = -η( t u + f t u T pt+1) 5: pt ≡ ∇x t Jt = f t x T pt+1 6: end for 7: for t = 0 to T -1 do 8: u * t = ut + δu * t 9: end for 10: ū ← {u * t } T -1 t=0 Hereafter we refer Q t (x t , u t ) to the Bellman objective. The Bellman principle recasts minimization over a control sequence to a sequence of minimization over each control. The value function V t summarizes the optimal cost-to-go at each stage, provided all afterward stages also being minimized. Differential Dynamic Programming (DDP). Despite providing the sufficient conditions to OCP, solving Eq. 1 for high-dimensional problems appears to be infeasible, well-known as the Bellman curse of dimensionality. To mitigate the computational burden of the minimization involved at each stage, one can approximate the Bellman objective in Eq. 1 with its second-order Taylor expansion. Such an approximation is central to DDP, a second-order trajectory optimization method that inherits a similar Bellman optimality structure while being computationally efficient. Alg. 1 summarizes the DDP algorithm. Given a nominal trajectory ( x, ū) with its initial cost J, DDP iteratively optimizes the objective value, where each iteration consists a backward (lines 2-6) and forward pass (lines 7-11). During the backward pass, DDP performs second-order expansion on the Bellman objective Q t at each stage and computes the updates through the following minimization, δu * t (δx t ) = arg min δut(δxt)∈Γ δx t { 1 2 1 δx t δu t T   0 Q t x T Q t u T Q t x Q t xx Q t xu Q t u Q t ux Q t uu   1 δx t δu t } , where Q t x = Q t u = t x + f t x T V t+1 x t u + f t u T V t+1 x , Q t xx = Q t uu = Q t ux = t xx + f t x T V t+1 xx f t x + V t+1 x • f t xx t uu + f t u T V t+1 xx f t u + V t+1 x • f t uu t ux + f t u T V t+1 xx f t x + V t+1 x • f t ux . ( ) We note that all derivatives in Eq. 3 are evaluated at the state-control pair (x t , u t ) at time t among the nominal trajectory. The derivatives of Q t follow standard chain rule and the dot notation represents the product of a vector with a 3D tensor. Γ δxt ={b t +A t δx t : b t ∈ R m , A t ∈ R m×n } denotes the set of all affine mapping from δx t . The analytic solution to Eq. 2 admits a linear form given by δu * t (δx t ) = k t + K t δx t , where k t -(Q t uu ) -1 Q t u and K t -(Q t uu ) -1 Q t ux (4) denote the open and feedback gains, respectively. δx t is called the state differential, which will play an important role later in our analysis. Note that this policy is only optimal locally around the nominal trajectory where the second order approximation remains valid. Substituting Eq. 4 back to Eq. 2 gives us the backward update for V x and V xx , V t x = Q t x -Q t T ux (Q t uu ) -1 Q t u , and V t xx = Q t xx -Q t T ux (Q t uu ) -1 Q t ux . (5) In the forward pass, DDP applies the feedback policy sequentially from the initial time step while keeping track of the state differential between the new simulated trajectory and the nominal trajectory. 3 DIFFERENTIAL DYNAMIC PROGRAMMING NEURAL OPTIMIZER

3.1. TRAINING DNNS AS TRAJECTORY OPTIMIZATION

Recall that DNNs can be interpreted as dynamical systems where each layer is viewed as a distinct time step. Consider e.g. the propagation rule in feedforward layers, x t+1 = σ t (h t ) , h t = g t (x t , u t ) = W t x t + b t . ( ) x t ∈ R nt and x t+1 ∈ R nt+1 represent the activation vector at layer t and t + 1, with h t ∈ R nt+1 being the pre-activation vector. σ t and g t respectively denote the nonlinear activation function and the affine transform parametrized by the vectorized weight u t [vec(W t ), b t ] T . Eq. 6 can be seen as a dynamical system (by setting f t ≡ σ t • g t in OCP) propagating the activation vector x t using u t . Next, notice that the gradient descent (GD) update, denoted δ ū * ≡ -η∇ ūJ with η being the learning rate, can be break down into each layer, i.e. δ ū * {δu * t } T -foot_0 t=0 , and computed backward by δu * t = arg min δut∈R m t {J t + ∇ ut J T t δu t + 1 2 δu T t ( 1 η I t )δu t } , where J t (x t , u t ) t (u t ) + J t+1 (f t (x t , u t ), u t+1 ) , J T (x T ) φ(x T ) is the per-layer objective 1 at layer t. It can be readily verified that p t ≡ ∇ xt J t gives the exact Back-propagation dynamics. Eq. 8 suggests that GD minimizes the quadratic expansion of J t with the Hessian ∇ 2 ut J t replaced by 1 η I t . Similarly, adaptive first-order methods, such as RMSprop and Adam, approximate the Hessian with the diagonal of the covariance matrix. Second-order methods, such as KFAC and EKFAC (Martens & Grosse, 2015; George et al., 2018) , compute full matrices using Gauss-Newton (GN) approximation: ∇ 2 u J t ≈ E[J ut J T ut ] = E[(x t ⊗ J ht )(x t ⊗ J ht ) T ] ≈ E[(x t x T t )] ⊗ E[(J ht J T ht )] . ( ) We now draw a novel connection between the training procedure of DNNs and DDP. Let us first summarize the Back-propagation (BP) with gradient descent in Alg. 2 and compare it with DDP (Alg. 1). At each training iteration, we treat the current weight as the control ū that simulates the activation sequence x. Starting from this nominal trajectory ( x, ū), both algorithms recursively define some layer-wise objectives (J t in Eq. 8 vs V t in Eq. 1), compute the weight/control update from the quadratic expansions (Eq. 7 vs Eq. 2), and then carry certain information (∇ xt J t vs (V t x , V t xx )) backward to the preceding layer. The computation graph between the two approaches is summarized in Fig. 1 . In the following proposition, we make this connection formally and provide conditions when the two algorithms become equivalent. Proposition 2. Assume Q t ux = 0 at all stages, then the backward dynamics of the value derivative can be described by the Back-propagation, ∀t ,V t x = ∇ xt J , Q t u = ∇ ut J , Q t uu = ∇ 2 ut J . ( ) In this case, the DDP policy is equivalent to the stage-wise Newton, in which the gradient is preconditioned by the block-wise inverse Hessian at each layer: k t + K t δx t = -(∇ 2 ut J) -1 ∇ ut J . ( ) If further we have Q t uu ≈ 1 η I, then DDP degenerates to Back-propagation with gradient descent. Evaluate derivatives of Q t with layer dynamics. The primary computation in DDPNOpt comes from constructing the derivatives of Q t at each layer. When the dynamics is represented by the layer  SGD It E[Ju t ] RMSprop diag( E[Ju t Ju t ] + ) E[Ju t ] KFAC & EKFAC E[xtx T t ] ⊗ E[J h t J T h t ] E[Ju t ] vanilla DDP E[Q t uu ] E[Q t u + Q t ux δxt] DDPNOpt Mt ∈    It , diag( E[Q t u Q t u ] + ) , E[xtx T t ] ⊗ E[V t h V t T h ]    E[Q t u + Q t ux δxt] propagation (recall Eq. 6 where we set f t ≡ σ t • g t ), we can rewrite Eq. 3 as: Q t x = g t T x V t h , Q t u = t u + g t T u V t h , Q t xx = g t T x V t hh g t x , Q t ux = g t T u V t hh g t x , where V t h σ t T h V t+1 x and V t hh σ t T h V t+1 xx σ t h absorb the computation of the non-parametrized activation function σ. Note that Eq. 12 expands the dynamics only up to first order, i.e. we omitt the tensor products which involves second-order expansions on dynamics, as the stability obtained by keeping only the linearized dynamics is thoroughly discussed and widely adapted in practical DDP usages (Todorov & Li, 2005) . The matrix-vector product with the Jacobian of the affine transform (i.e. g t u , g t x ) can be evaluated efficiently for both feedforward (FF) and convolution (Conv) layers: h t FF = W t x t + b t ⇒ g t x T V t h = W T t V t h , g t u T V t h = x t ⊗ V t h , ( ) h t Conv = ω t * x t ⇒ g t x T V t h = ω T t * V t h , g t u T V t h = x t * V t h , where ⊗, * , and * respectively denote the Kronecker product and (de-)convolution operator. Curvature approximation. Next, since DNNs are highly over-parametrized models, u t (i.e. the layer weight) will be in high-dimensional space. This makes Q t uu and (Q t uu ) -1 computationally intractable to solve; thus requires approximation. Recall the interpretation we draw in Eq. 8 where existing optimizers differ in approximating the Hessian ∇ 2 ut J t . DDPNOpt adapts the same curvature approximation to Q t uu . For instance, we can approximate Q t uu simply with an identity matrix I t , adaptive diagonal matrix diag( E[Q t u Q t u ] ), or the GN matrix: Q t uu ≈ E[Q t u Q t u T ] = E[(x t ⊗ V t h )(x t ⊗ V t h ) T ] ≈ E[x t x T t ] ⊗ E[V t h V t h T ] . ( ) Table 2 summarizes the difference in curvature approximation (i.e. the precondition M t ) for different methods. Note that DDPNOpt constructs these approximations using (V, Q) rather than J since they consider different layer-wise objectives. As a direct implication from Proposition 2, DDPNOpt will degenerate to the optimizer it adapts for curvature approximation whenever all Q t ux vanish. Outer-product factorization. When the memory efficiency becomes nonnegligible as the problem scales, we make GN approximation to ∇ 2 φ, since the low-rank structure at the prediction layer has been observed for problems concerned in this work (Nar et al., 2019; Lezama et al., 2018) . In the following proposition, we show that for a specific type of OCP, which happens to be the case of DNN training, such a low-rank structure preserves throughout the DDP backward pass. Proposition 3 (Outer-product factorization in DDPNOpt). Consider the OCP where t ≡ t (u t ) is independent of x t , If the terminal-stage Hessian can be expressed by the outer product of vector z T x , ∇ 2 φ(x T ) = z T x ⊗ z T x (for instance, z T x = ∇φ for GN), then we have the factorization for all t: Q t ux = q t u ⊗ q t x , Q t xx = q t x ⊗ q t x , V t xx = z t x ⊗ z t x . ( ) q t u , q t x , and z t x are outer-product vectors which are also computed along the backward pass. q t u = f t u T z t+1 x , q t x = f t x T z t+1 x , z t x = 1 -q t T u (Q t uu ) -1 q t u q t x . ( ) The derivation is left in Appendix A.3. In other words, the outer-product factorization at the final layer can be backward propagated to all proceeding layers. Thus, large matrices, such as Q t ux , Q t xx , V t xx , and even feedback policies K t , can be factorized accordingly, greatly reducing the complexity. Sample batch initial state from dataset, X 0 ≡ {x (i) 0 } B i=1 ∼ D 5: Forward propagate to generate nominal batch trajectory X t Forward simulation 6: Set V T x (i) = ∇ x (i) Φ(x (i) T ) and V T xx (i) = ∇ 2 x (i) Φ(x (i) T ) 7: for t = T -1 to 0 do Backward Bellman pass 8: Compute Q t u , Q t x , Q t xx , Q t ux with Eq. 12 (or Eq. 16-17 if factorization is used) 9: Compute E[Q t uu ] with one of the precondition matrices in Table 2 10: Store the layer-wise feedback policy δu * t (δX t ) = 1 B B i=1 k (i) t + K (i) t δx (i) t 11: Compute V t x (i) and V t xx (i) with Eq. 5 (or Eq. 16-17 if factorization is used) 12: V t xx (i) ← V t xx (i) + Vxx I t if regularization is used 13: end for 14: Set x(i) 0 = x (i) 0 15: for t = 0 to T -1 do Additional forward pass 16: u * t = u t + δu * t (δX t ), where δX t = { x(i) t -x (i) t } B i=1 17: x(i) t+1 = f t ( x(i) t , u * t ) 18: end for 19: ū(k+1) ← {u * t } T -1 t=0 20: end for Regularization on V xx . Finally, we apply Tikhonov regularization to the value Hessian V t xx (line 12 in Alg. 3). This can be seen as placing a quadratic state-cost and has been shown to improve stability on optimizing complex humanoid behavior (Tassa et al., 2012) . For the application of DNN where the dimension of the state (i.e.the vectorized activation) varies during forward/backward pass, the Tikhonov regularization prevents the value Hessian from low rank (throught g t T u V t hh g t x ); hence we also observe similar stabilization effect in practice.

4. THE ROLE OF FEEDBACK POLICIES

DDPNOpt differs from existing methods in the use of feedback K t and state differential δx t . The presence of these terms result in a distinct backward pass inherited with the Bellman optimality. As shown in Table 2 , the two frameworks differ in computing the update directions d t , where the Bellman formulation applies the feedback policy through additional forward pass with δx t . We have built the connection between these two d t in Proposition 2. In this section, we further characterize the role of the feedback policy K t and state differential δx t during optimization. First we discuss the relation of DDPNOpt with other second-order methods and highlight the role feedback during training. To do so let us consider the example in Fig. 2a . Given an objective L expanded at (x 0 , u 0 ), standard second-order methods compute the Hessian w.r.t. u then apply the update δu = -L -1 uu L u (shown as green arrows). DDPNOpt differs in that it also computes the mixed partial derivatives, i.e. L ux . The resulting update law has the same intercept but with an additional feedback term linear in δx (shown as red arrows). Thus, DDPNOpt searches for an update from the affine mapping Γ δxt (Eq. 2), rather than the vector space R mt (Eq. 7). Next, to show how the state differential δx t arises during optimization, notice from Alg. 1 that xt can be compactly expressed as xt = F t (x 0 , ū + δ ū * (δ x))foot_1 . Therefore, δx t = xtx t captures the state difference when new updates δ ū * (δ x) are applied until layer t -1. Now, consider the 2D example in Fig 2b . Back-propagation proposes the update directions (shown as blue arrows) from the first-order derivatives expanded along the nominal trajectory ( x, ū). However, as the weight at each layer is correlated, parameter updates from previous layers δ ū * s affect proceeding states {x t : t > s}, thus the trustworthiness of their descending directions. As shown in Fig 2c , cascading these (green) updates may cause an over-shoot w.r.t. the designed target. From the trajectory optimization perspective, a much stabler direction will be instead ∇ ut J t ( xt , u t ) (shown as orange), where the derivative is  K t δx t ≈ arg min δut(δxt)∈Γ δx t ∇ ut J( xt , u t + δu t (δx t )) -∇ ut J(x t , u t ) . ( ) Thus, the feedback direction compensates the over-shoot by steering the GD update toward ∇ ut J t ( xt , u t ) after observing δx t . The difference between ∇ ut J( xt , u t ) and ∇ ut J(x t , u t ) cannot be neglected especially during early training when the loss landscape contains nontrivial curvature everywhere (Alain et al., 2019) . In short, the use of feedback K t and state differential δx t arises from the fact that deep nets exhibit chain structures. DDPNOpt feedback policies thus have a stabilization effect on robustifying the training dynamics against e.g.improper hyper-parameters which may cause unstable training. This perspective (i.e. optimizing chained parameters) is explored rigorously in trajectory optimization, where DDP is shown to be numerically stabler than direct optimization such as Newton method (Liao & Shoemaker, 1992) . Remarks on other optimizers. Our discussions so far rigorously explore the connection between DDP and stage/layer-wise Newton, thus include many popular second-order training methods. General Newton method coincides with DDP only for linear dynamics (Murray & Yakowitz, 1984) , despite both share the same convergence rate when the dynamics is fully expanded to second order. We note that computing layer-wise value Hessians with only first-order expansion on the dynamics (Eq. 12) resembles the computation in Gauss-Newton method (Botev et al., 2017) . For other controltheoretic methods, e.g. PID optimizers (An et al., 2018) , they mostly consider the dynamics over training iterations. DDPNOpt instead focuses on the dynamics inherited in the DNN architecture.

5.1. PERFORMANCE ON CLASSIFICATION DATASET

Networks & Baselines Setup. We first validate the performance of training fully-connected (FCN) and convolution networks (CNN) using DDPNOpt on classification datasets. FCN consists of 5 fully-connected layers with the hidden dimension ranging from 10 to 32, depending on the size of the dataset. CNN consists of 4 convolution layers (with 3×3 kernel, 32 channels), followed by 2 fully-connected layers. We use ReLU activation on all datasets except Tanh for WINE and DIGITS to better distinguish the differences between optimizers. The batch size is set to 8-32 for datasets trained with FCN, and 128 for datasets trained with CNN. As DDPNOpt combines strengths from both standard training methods and OCP framework, we select baselines from both sides. This includes first-order methods, i.e. SGD (with tuned momentum), RMSprop, Adam, and second-order method EKFAC (George et al., 2018) , which is a recent extension of the popular KFAC (Martens & Grosse, 2015) . For OCP-inspired methods, we compare DDPNOpt with vanilla DDP and E-MSA (Li et al., 2017) , which is also a second-order method yet built upon the PMP framework. Regarding the curvature approximation used in DDPNOpt (M t in Table 2 ), we found that using adaptive diagonal and GN matrices respectively for FCNs and CNNs give the best performance in practice. We leave the complete experiment setup and additional results in Appendix A.6. Training Results. Table 3 presents the results over 10 random trials. It is clear that DDPNOpt outperforms two OCP baselines on all datasets and network types. In practice, both baselines suffer from unstable training and require careful tuning on the hyper-parameters. In fact, we are not able to obtain results for vanilla DDP with any reasonable amount of computational resources when the problem size goes beyond FC networks. This is in contrast to DDPNOpt which adapts amortized  Memory O(X 2 L) O(BX 3 L) O(X 2 L + BX) Speed O(BX 2 L) O(B 3 X 3 L) O(BX 2 L) curvature estimation from widely-used methods; thus exhibits much stabler training dynamics with superior convergence. In Table 4 , we provide the analytic runtime and memory complexity among different methods. While vanilla DDP grows cubic w.r.t. BX, DDPNOpt reduces the computation by orders of magnitude with efficient approximation presented in Sec. 3. As a result, when measuring the actual computational performance with GPU parallelism, DDPNOpt runs nearly as fast as standard methods and outperforms E-MSA by a large margin. The additional memory complexity, when comparing DDP-inspired methods with Back-propagation methods, comes from the layer-wise feedback policies. However, DDPNOpt is much memory-efficient compared with vanilla DDP by exploiting the factorization in Proposition 3. Ablation Analysis. On the other hand, the performance gain between DDPNOpt and standard methods appear comparatively small. We conjecture this is due to the inevitable use of similar curvature adaptation, as the local geometry of the landscape directly affects the convergence behavior. To identify scenarios where DDPNOpt best shows its effectiveness, we conduct an ablation analysis on the feedback mechanism. This is done by recalling Proposition 2: when Q t ux vanishes, DDPNOpt degenerates to the method associated with each precondition matrix. For instance, DDPNOpt with identity (resp. adaptive diagonal and GN) precondition M t will generate the same updates as SGD (resp. RMSprop and EKFAC) when all Q t ux are zeroed out. In other words, these DDPNOpt variants can be viewed as the DDP-extension to existing baselines. In Fig. 4a we report the performance difference between each baseline and its associated DDPNOpt variant. Each grid corresponds to a distinct training configuration that is averaged over 10 random trails, and we keep all hyper-parameters (e.g. learning rate and weight decay) the same between baselines and their DDPNOpt variants. Thus, the performance gap only comes from the feedback policies, or equivalently the update directions in Table 2 . Blue (resp. red) indicates an improvement (resp. degradation) when the feedback policies are presented. Clearly, the improvement over baselines remains consistent across most hyper-parameters setups, and the performance gap tends to become obvious as the learning rate increases. This aligns with the previous study on numerical stability (Liao & Shoemaker, 1992) , which suggests the feedback can stabilize the optimization when e.g. larger control updates are taken. Since larger control corresponds to a further step size in the application of DNN training, one should expect DDPNOpt to show its robustness as the learning rate increases. As shown in Fig. 4b , such a stabilization can also lead to smaller variance and faster convergence. This sheds light on the benefit gained by bridging two seemly disconnected methodologies between DNN training and trajectory optimization.

5.2. DISCUSSION ON FEEDBACK POLICIES

Visualization of Feedback Policies. To understand the effect of feedback policies more perceptually, in Fig. 5 we visualize the feedback policy when training CNNs. This is done by first conducting singular-value decomposition on the feedback matrices K t , then projecting the leading right-singular vector back to image space (see Alg. 4 and Fig. 7 in Appendix for the pseudo-code). These feature maps, denoted δx max in Fig. 5 , correspond to the dominating differential image that the policy shall respond with during weight update. Fig. 5 shows that the feedback policies indeed capture non-trivial visual features related to the pixel-wise difference between spatially similar classes, e.g. (8, 3) or (7, 1). These differential maps differ from adversarial perturbation (Goodfellow et al., 2014) as the former directly links the parameter update to the change in activation; thus being more interpretable. Vanishing Gradient. Lastly, we present an interesting finding on how the feedback policies help mitigate vanishing gradient (VG), a notorious effect when DNNs become impossible to train as gradients vanish along Back-propagation. Fig. 6a reports results on training a sigmoid-activated DNN on DIGITS. We select SGD-VGR, which imposes a specific regularization to mitigate VG (Pascanu et al., 2013) , and EKFAC as our baselines. While both baselines suffer to make any progress, DDPNOpt continues to generate non-trivial updates as the state-dependent feedback, i.e. K t δx t , remains active. The effect becomes significant when dynamics is fully expanded to the second order. As shown in Fig. 6b , the update norm from DDPNOpt is typically 5-10 times larger. We note that in this experiment, we replace the cross-entropy (CE) with Max-Mahalanobis center (MMC) loss, a new classification objective that improves robustness on standard vision datasets (Pang et al., 2019) . MMC casts classification to distributional regression, providing denser Hessian and making problems similar to original trajectory optimization. None of the algorithms escape from VG using CE. We highlight that while VG is typically mitigated on the architecture basis, by having either unbounded activation function or residual blocks, DDPNOpt provides an alternative from the algorithmic perspective.

6. CONCLUSION

In et al., 1962) ). Let ū * be the optimal control trajectory for OCP and x * be the corresponding state trajectory. Then, there exists a co-state trajectory p * {p * t } T t=1 , such that x * t+1 = ∇ p H t x * t , p * t+1 , u * t , x * 0 = x 0 , p * t = ∇ x H t x * t , p * t+1 , u * t , p * T = ∇ x φ (x * T ) , u * t = arg min v∈R m H t x * t , p * t+1 , v . where H t : R n × R n × R m → R is the discrete-time Hamiltonian given by H t (x t , p t+1 , u t ) t (x t , u t ) + p T t+1 f t (x t , u t ) , and Eq. 19b is called the adjoint equation. The discrete-time PMP theorem can be derived using KKT conditions, in which the co-state p t is equivalent to the Lagrange multiplier. Note that the solution to Eq. 19c admits an open-loop process in the sense that it does not depend on state variables. This is in contrast to the Dynamic Programming principle, in which a feedback policy is considered. It is natural to ask whether the necessary condition in the PMP theorem relates to first-order optimization methods in DNN training. This is indeed the case as pointed out in Li et al. (2017) : Lemma 5 (Li et al. (2017) ). Back-propagation satisfies Eq. 19b and gradient descent iteratively solves Eq. 19c. Lemma 5 follows by first expanding the derivative of Hamiltonian w.r.t. x t , ∇ xt H t (x t , p t+1 , u t ) = ∇ xt t (x t , u t ) + ∇ xt f t (x t , u t ) T p t+1 = ∇ xt J( ū; x 0 ) . Thus, Eq. 19b is simply the chain rule used in the Back-propagation. When H t is differentiable w.r.t. u t , one can attempt to solve Eq. 19c by iteratively taking the gradient descent. This will lead to u (k+1) t = u (k) t -η∇ ut H t (x t , p t+1 , u t ) = u (k) t -η∇ ut J( ū; x 0 ) , ( ) where k and η denote the update iteration and step size. Thus, existing optimization methods can be interpreted as iterative processes to match the PMP optimality conditions. Inspired from Lemma 5, Li et al. (2017) with L-BFGS per layer and per training iteration. As a result, we consider E-MSA also as second-order method. A.2 PROOF OF PROPOSITION 2 Proof. We first prove the following lemma which connects the backward pass between two frameworks in the degenerate case. Lemma 6. Assume Q t ux = 0 at all stages, then we have V t x = ∇ xt J , and V t xx = ∇ 2 xt J , ∀t . Proof. It is obvious to see that Eq. 25 holds at t = T . Now, assume the relation holds at t + 1 and observe that at the time t, the backward passes take the form of V t x = Q t x -Q t T ux (Q t uu ) -1 Q t u = t x + f t x T ∇ xt+1 J = ∇ xt J , V t xx = Q t xx -Q t T ux (Q t uu ) -1 Q t ux = ∇ xt { t x + f t x T ∇ xt+1 J} = ∇ 2 xt J , where we recall J t = t + J t+1 (f t ) in Eq. 8. Now, Eq. 11 follows by substituting Eq. 25 to the definition of Q t u and Q t uu Q t u = t u + f t u T V t+1 x = t u + f t u T ∇ xt+1 J = ∇ ut J , Q t uu = t uu + f t u T V t+1 xx f t u + V t+1 x • f t uu = t uu + f t u T (∇ 2 xt+1 J)f t u + ∇ xt+1 J • f t uu = ∇ ut { t u + f t u T ∇ xt+1 J} = ∇ 2 ut J . Consequently, the DDP feedback policy degenerates to layer-wise Newton update.

A.3 PROOF OF PROPOSITION 3

Proof. We will prove Proposition 3 by backward induction. Suppose at layer t + 1, we have V t+1 xx = z t+1 x ⊗ z t+1 x and t ≡ t (u t ), then Eq. 3 becomes Q t xx = f t x T V t+1 xx f t x = f t x T (z t+1 x ⊗ z t+1 x )f t x = (f t x T z t+1 x ) ⊗ (f t x T z t+1 x ) Q t ux = f t u T V t+1 xx f t x = f t u T (z t+1 x ⊗ z t+1 x )f t x = (f t u T z t+1 x ) ⊗ (f t x T z t+1 x ) . Setting q t x := f t x T z t+1 x and q t u := f t u T z t+1 x will give the first part of Proposition 3. Next, to show the same factorization structure preserves through the preceding layer, it is sufficient to show V t xx = z t x ⊗ z t x for some vector z t x . This is indeed the case. V t xx = Q t xx -Q t T ux (Q t uu ) -1 Q t ux = q t x ⊗ q t x -(q t u ⊗ q t x ) T (Q t uu ) -1 (q t u ⊗ q t x ) = q t x ⊗ q t x -(q t T u (Q t uu ) -1 q t u )(q t x ⊗ q t x ) , where the last equality follows by observing q t T u (Q t uu ) -1 q t u is a scalar. Set z t x = 1 -q t T u (Q t uu ) -1 q t u q t x will give the desired factorization. A.4 DERIVATION OF EQ. 12 For notational simplicity, we drop the superscript t and denote V x ∇ x V t+1 (x t+1 ) as the derivative of the value function at the next state. Q u = u + f T u V x = u + g T u σ T h V x , Q uu = uu + ∂ ∂u {g T u σ T h V x } = uu + g T u σ T h ∂ ∂u {V x } + g T u ( ∂ ∂u {σ h }) T V x + ( ∂ ∂u {g u }) T σ T h V x = uu + g T u σ T h V x x σ h g u + g T u (V T x σ hh g u ) + g T uu σ T h V x = uu + g T u (V hh + V x • σ hh )g u + V h • g uu The last equation follows by recalling V h σ T h V x and V hh σ T h V x x σ h . Follow similar derivation, we have Q x = x + g T x V h Q xx = xx + g T x (V hh + V x • σ hh )g x + V h • g xx Q ux = ux + g T u (V hh + V x • σ hh )g x + V h • g ux Remarks. For feedforward networks, the computational overhead in Eq. 12 and 26 can be mitigated by leveraging its affine structure. Since g is bilinear in x t and u t , the terms g t xx and g t uu vanish. The tensor g t ux admits a sparse structure, whose computation can be simplified to [g t ux ] (i,j,k) = 1 iff j = (k -1)n t+1 + i , [V t h • g t ux ] ((k-1)nt+1:knt+1,k) = V t h . For the coordinate-wise nonlinear transform, σ t h and σ t hh are diagonal matrix and tensor. In most learning instances, stage-wise losses typically involved with weight decay alone; thus the terms t x , t xx , t ux also vanish. A.5 DERIVATION OF EQ. 18 Eq. 18 follows by an observation that the feedback policy K t δx t = -(Q t uu ) -1 Q t ux δx t stands as the minimizer of the following objective K t δx t = arg min δut(δxt)∈Γ (δxt) ∇ ut Q(x t + δx t , u t + δu t (δx t )) -∇ ut Q(x t , u t ) , where Γ (δx t ) denotes all affine mappings from δx t to δu t and • can be any proper norm in the Euclidean space. Eq. 28 follows by the Taylor expansion of Q(x t + δx t , u t + δu t ) to its first order, ∇ ut Q(x t + δx t , u t + δu t ) = ∇ ut Q(x t , u t ) + Q t ux δx t + Q t uu δu t . When Q = J, we will arrive at Eq. 18. From Proposition 2, we know the equality holds when all Q s xu vanish for s > t. In other words, the approximation in Eq. 18 becomes equality when all aferward layer-wise objectives s > t are expanded only w.r.t. u s . A.6 EXPERIMENT DETAIL A.6.1 SETUP Clarification Dataset. All networks in the classification experiments are composed of 5-6 layers. For the intermediate layers, we use ReLU activation on all dataset, except Tanh on WINE and DIGITS. We use identity mapping at the last prediction layer on all dataset except WINE, where we use sigmoid instead to help distinguish the performance among optimizers. For feedforward networks, the dimension of the hidden state is set to 10-32. On the other hand, we use standard 3 × 3 convolution kernels for all CNNs. The batch size is set 8-32 for dataset trained with feedforward networks, and 128 for dataset trained with convolution networks. For each baseline we select its own hyperparameter from an appropriate search space, which we detail in Table 5 . We use the implementation in https://github.com/Thrandis/EKFAC-pytorch for EKFAC and implement our own E-MSA in PyTorch since the official code released from Li et al. (2017) does not support GPU implementation. We impose the GN factorization presented in Proposition 3 for all CNN training. Regarding the machine information, we conduct our experiments on GTX 1080 TI, RTX TITAN, and four Tesla V100 SXM2 16GB. Procedure to Generate Fig. 5 . First, we perform standard DDPNOpt steps to compute layer-wise policies. Next, we conduct singular-value decomposition on the feedback matrix (k t , K t ). In this way, the leading right-singular vector corresponding to the dominating that the feedback policy shall respond with. Since this vector is with the same dimension as the hidden state, which is most likely not the same as the image space, we project the vector back to image space using the techniques proposed in (Zeiler & Fergus, 2014) . The pseudo code and computation diagram are included in Alg. 4 and Fig. 7 . Batch trajectory optimization on synthetic dataset. One of the difference between DNN training and trajectory optimization is that for the former, we aim to find an ultimate control law that can drive every data point in the training set, or sampled batch, to its designed target. Despite seemly trivial from the ML perspective, this is a distinct formulation to OCP since the optimal policy typically varies at different initial state. As such, we validate performance of DDPNOpt in batch trajectories optimization on a synthetic dataset, where we sample data from k ∈ {5, 8, 12, 15} Gaussian clusters in R 30 . Since conceptually a DNN classifier can be thought of as a dynamical system guiding trajectories of samples toward the target regions belong to their classes, we hypothesize that for the DDPNOpt to show its effectiveness on batch training, the feedback policy must act as an ensemble policy that combines the locally optimal policy of each class. Fig. 8 shows the spectrum distribution, sorted in a descending order, of the feedback policy in the prediction layer. The result shows that the number of nontrivial eigenvalues matches exactly the number of classes in each setup (indicated by the vertical dashed line). As the distribution in the prediction layer concentrates to k bulks through training, the eigenvalues also increase, providing stronger feedback to the weight update. Ablation analysis on Adam. Fig. 9 reports the ablation analysis on Adam using the same setup as in Fig. 4a , i.e. we keep all hyper-parameters the same for each experiment so that the performance More experiments on vanishing gradient. Recall that Fig. 6 reports the training performance using MMC loss on Sigmoid-activated networks. In Fig. 10a , we report the result when training the same networks but using CE loss (notice the numerical differences in the y axis for different objectives). None of the presented optimizers were able to escape from vanishing gradient, as evidenced by the vanishing update magnitude. On the other hands, changing the networks to ReLU-activated networks eliminates the vanishing gradient, as shown in Fig. 10b . Fig. 11 reports the performance with other first-order adaptive optimizers including Adam and RMSprop. In general, adaptive first-order optimizers are more likely to escape from vanishing gradient since the diagonal precondition matrix (recall M t = E[J ut J ut ] in Table 2 ) rescales the vanishing update to a fixed norm. However, as shown in Fig. 11 , DDPNOpt* (the variant of DDPNOpt that utilizes similar adaptive first-order precondition matrix) converges faster compared with these adaptive baselines. Fig. 12 illustrates the selecting process on the learing-rate tuning when we report Fig. 6 . The training performance for both SGD-VGR and EKFAC remains unchanged when tuning the learning rate. In



Hereafter we drop xt in all t(•) as the layer-wise loss typically involves weight regularization alone. Ft ft • • • • • f0 denotes the compositional dynamics propagating x0 with the control sequence {us} t s=0 .



Figure 1: DDP backward propagates the value derivatives (V x , V xx ) instead of ∇ xt J and updates weight using layer-wise feedback policy, δu * t (δx t ), with additional forward propagation. Proof is left in Appendix A.2. Proposition 2 states that the backward pass in DDP collapses to BP when Q ux vanishes at all stages. In other words, existing training methods can be seen as special cases of DDP when the mixed derivatives (i.e. ∇ xtut ) of the layer-wise objective are discarded.

Figure 2: (a) A toy illustration of the standard update (green) and the DDP feedback (red). The DDP policy in this case is a line lying at the valley of objective L. (bc) Trajectory optimization viewpoint of DNN training. Green and orange arrows represent the proposed updates from GD and DDP.

Figure 3: Runtime comparison on MNIST.

Figure 4: (a) Performance difference between DDPNOpt and baselines on DIGITS across hyperparameter grid. Blue (resp. red) indicates an improvement (resp. degradation) over baselines. We observe similar behaviors on other datasets. (b) Examples of the actual training dynamics.

Figure 5: Visualization of the feedback policies on MNIST.

Figure 7: Pictorial illustration for Alg. 4.

Figure 8: Spectrum distribution on synthetic dataset.

Terminology mapping

Update rule at each layer t, u t ← u t -ηM -1 t d t . (Expectation taken over batch data)

Algorithm 3 Differential Dynamic Programming Neural Optimizer (DDPNOpt)

Performance comparison on accuracy (%). All values averaged over 10 seeds.

Computational complexity in backward pass. (B: batch size, X: hidden state dim., L: # of layers)

Weinan, Jiequn Han, and Qianxiao Li. A mean-field optimal control formulation of deep learning. arXiv preprint arXiv:1807.01083, 2018. Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In European conference on computer vision, pp. 818-833. Springer, 2014. CONNECTION BETWEEN PONTRYAGIN MAXIMUM PRINCIPLE AND DNNS TRAINING Development of the optimality conditions to OCP can be dated back to 1960s, characterized by both the Pontryagin's Maximum Principle (PMP) and the Dynamic Programming (DP). Here we review Theorem of PMP and its connection to training DNNs. Theorem 4 (Discrete-time PMP (Pontryagin

proposed a PMP-inspired method, named Extended Method of Successive Approximations (E-MSA), which solves the following augmented HamiltonianHt (x t , p t+1 , u t , x t+1 , p t ) H t (x t , p t+1 , u t )Ht is the original Hamiltonian augmented with the feasibility constraints on both forward states and backward co-states. E-MSA solves the minimization Ht (x t , p t+1 , u t , x t+1 , p t ) (24)

Hyper-parameter search

Vxx = 1 × 10 -9 1 × 10 -8 5 × 10 -6 1 × 10 -5 Vxx = 1 × 10 -7 5 × 10 -7 5 × 10 -6 1 × 10 -5 Accuracy (%). ( Vxx denotes the V xx regularization.)SGDDDPNOpt with Mt = It Vxx = 5 × 10 -5 1 × 10 -4 5 × 10 -4 1 × 10 -3 Vxx = 1 × 10 -9 1 × 10 -8 5 × 10 -6 1 × 10 -5 Vxx = 1 × 10 -7 5 × 10 -7 5 × 10 -6 1 × 10 -5

ACKNOWLEDGMENTS

The authors would like to thank Chen-Hsuan Lin, Yunpeng Pan, Yen-Cheng Liu, and Chia-Wen Kuo for many helpful discussions on the paper. This research was supported by NSF Award Number 1932288.

annex

difference only comes from the existence of feedback policies. It is clear that the improvements from the feedback policies remain consistent for Adam optimizer. Ablation analysis on DIGITS compared with best-tuned baselines. Fig. 4 reports the performance difference between baselines and DDPNOpt under different hyperparameter setupts. Here, we report the numerical values when each baseline uses its best-tuned learning rate (which is the values we report in Table 3 ) and compare with its DDPNOpt counterpart using the same learning rate. As shown in Tables 6, 7 , and 8, for most cases extending the baseline to accept the Bellman framework improves the performance. Numerical absolute values in ablation analysis (DIGITS). Fig. 4a reports the relative performance between each baseline and its DDPNOpt counterpart under different learning rate and regularization setups. In Table 9 and 10, we report the absolute numerical values of this experiment. For instance, the most left-upper grid in Fig. 4a , i.e.the training loss difference between DDPNOpt and SGD with learning rate 0.4 and V xx regularization 5 × 10 -5 , corresponds to 0.1974 -0.1662 in Table 9 . All values in these tables are averaged over 10 seeds. 

