A ROBUST FUEL OPTIMIZATION STRATEGY FOR HY-BRID ELECTRIC VEHICLES: A DEEP REINFORCEMENT LEARNING BASED CONTINUOUS TIME DESIGN AP-PROACH

Abstract

This paper deals with the fuel optimization problem for hybrid electric vehicles in reinforcement learning framework. Firstly, considering the hybrid electric vehicle as a completely observable non-linear system with uncertain dynamics, we solve an open-loop deterministic optimization problem to determine a nominal optimal state. This is followed by the design of a deep reinforcement learning based optimal controller for the non-linear system using concurrent learning based system identifier such that the actual states and the control policy are able to track the optimal state and optimal policy, autonomously even in the presence of external disturbances, modeling errors, uncertainties and noise and signigicantly reducing the computational complexity at the same time, which is in sharp contrast to the conventional methods like PID and Model Predictive Control (MPC) as well as traditional RL approaches like ADP, DDP and DQN that mostly depend on a set of pre-defined rules and provide sub-optimal solutions under similar conditions. The low value of the H-infinity (H ∞ ) performance index of the proposed optimization algorithm addresses the robustness issue. The optimization technique thus proposed is compared with the traditional fuel optimization strategies for hybrid electric vehicles to illustate the efficacy of the proposed method.

1. INTRODUCTION

Hybrid electric vehicles powered by fuel cells and batteries have attracted great enthusiasm in modern days as they have the potential to eliminate emissions from the transport sector. Now, both the fuel cells and batteries have got several operational challenges which make the separate use of each of them in automotive systems quite impractical. HEVs and PHEVs powered by conventional diesel engines and batteries merely reduce the emissions, but cannot eliminate completely. Some of the drawbacks include carbon emission causing environmental pollution from fuel cells and long charging times, limited driving distance per charge, non-availability of charging stations along the driving distance for the batteries. Fuel Cell powered Hybrid Electric Vehicles (FCHEVs) powered by fuel cells and batteries offer emission-free operation while overcoming the limitations of driving distance per charge and long charging times. So, FCHEVs have gained significant attention in recent years. As we find, most of the existing research which studied and developed several types of Fuel and Energy Management Systems (FEMS) for transport applications include Sulaiman et al. (2018) who has presented a critical review of different energy and fuel management strategies for FCHEVs. Li et al. (2017) has presented an extensive review of FMS objectives and strategies for FCHEVs. These strategies, however can be divided into two groups, i.e., model-based and modelfree. The model-based methods mostly depend on the discretization of the state space and therefore suffers from the inherent curse of dimensionality. The coumputational complexity increases in an exponential fashion with the increase in the dimension of the state space. This is quite evident in the methods like state-based EMS (Jan et al., 2014; Zadeh et al., 2014; 2016) , rule-based fuzzy logic strategy (Motapon et al., 2014) , classical PI and PID strategies (Segura et al., 2012 ), Potryagin's minimum principle (PMP) (Zheng et al., 2013; 2014) , model predictive control (MPC) (Kim et al., 2007; Torreglosa et al., 2014) and differential dynamic programming (DDP) (Kim et al., 2007) . Out of all these methods, differential dynamic programming is considered to be computationally quite efficient which rely on the linearization of the non-linear system equations about a nominal state trajectory followed by a policy iteration to improve the policy. In this approach, the control policy for fuel optimization is used to compute the optimal trajectory and the policy is updated until the convergence is achieved. The model-free methods mostly deal with the Adaptive Dynamic Programming (Bithmead et al., 1991; Zhong et al., 2014) and Reinforcement Learning (RL) based strategies (Mitrovic et al., 2010; Khan et al., 2012) icluding DDP (Mayne et al., 1970) . Here, they tend to compute the control policy for fuel optimization by continous engagement with the environment and measuring the system response thus enabling it to achieve at a solution of the DP equation recursively in an online fashion. In deep reinforcement learning, multi-layer neural networks are used to represent the learning function using a non-linear parameterized approximation form. Although a compact paremeterized form do exist for the learning function, the inability to know it apriori renders the method suffer from the curse of dimensionality (O(d 2 ) where, d is the dimension of the state space), thus making it infeasible to apply to a high-dimemsional fuel managememt system. The problem of computational complexity of the traditional RL methods like policy iteration (PI) and value iteration (VI) (Bellman et al., 1954; 2003; Barto et al., 1983; Bartsekas, 2007) can be overcome by a simulation based approach (Sutton et al., 1998) where the policy or the value function can be parameterized with sufficient accuracy using a small number of parameters. Thus, we will be able to transform the optimal control problem to an approximation problem in the parameter space (Bartesekas et al., 1996; Tsitsiklis et al., 2003; Konda et al., 2004) side stepping the need for model knowledge and excessive computations. However, the convergence requires sufficient exploration of the state-action space and the optimality of the obtained policy depends primarily on the accuracy of the parameterization scheme. As a result, a good approximation of the value function is of utmost importance to the stability of the closed-loop system and it requires convergence of the unknown parameters to their optimal values. Hence, this sufficient exploration condition manifests itself as a persistence of excitation (PE) condition when RL is implemented online (Mehta et al., 2009; Bhasin et al., 2013; Vrabie, 2010) which is impossible to be guaranteed a priori. Most of the traditional approaches for fuel optimization are unable to adrress the robustness issue. The methods described in the literature including those of PID (Segura et al.,2012) , Model Predictive Control (MPC) (Kim et al.,2007; Torreglosa et al., 2014) and Adaptive Dynamic Programming (Bithmead et al.,1991; Zhong et al., 2014) as well as the simulation based RL strategies (Bartesekas et al., 1996; Tsitsiklis et al., 2003; Konda et al., 2004 ) suffer from the drawback of providing a suboptimal solution in the presence of external disturbances and noise. As a result, application of these methods for fuel optimization for hybrid electric vehicles that are plagued by various disturbances in the form of sudden charge and fuel depletion, change in the environment and in the values of the parameters like remaining useful life, internal resistance, voltage and temperature of the battery, are quite impractical. The fuel optimization problem for the hybrid electric vehicle therefore have been formulated as a fully observed stochastic Markov Decision Process (MDP). Instead of using Trajectory-optimized LQG (T-LQG) or Model Predictive Control (MPC) to provide a sub-optimal solution in the presence of disturbances and noice, we propose a deep reinforcement learning-based optimization strategy using concurrent learning (CL) that uses the state-derivative-action-reward tuples to present a robust optimal solution. The convergence of the weight estimates of the policy and the value function to their optimal values justifies our claim. The two major contributions of the proposed approch can be therefore be summarized as follows: 1) The popular methods in RL literature including policy iteration and value iteration suffers from the curse of dimensionality owing to the use of a simulation based technique which requires sufficient exploration of the state space (PE condition). Therefore, the proposed model-based RL scheme aims to relax the PE condition by using a concurrent learning (CL)-based system identifier to reduce the computational complexity. Generally, an estimate of the true controller designed using the CLbased method introduces an approximate estimation error which makes the stability analysis of the system quite intractable. The proposed method, however, has been able to establish the stability of the closed-loop system by introducing the estimation error and analyzing the augmented system trajectory obtained under the influnece of the control signal. 2) The proposed optimization algorithm implemented for fuel management in hybrid electric vehicles will nullify the limitations of the conventional fuel management approaches (PID, Model Predictive Control, ECMS, PMP) and traditional RL approaches (Adaptive Dynamic Proagramming, DDP, DQN), all of which suffers from the problem of sub-optimal behaviour in the presence of external disturbances, model-uncertainties, frequent charging and discharging, change of enviroment and other noises. The H-infinity (H ∞ ) performance index defined as the ratio of the disturbance to the control energy has been established for the RL based optimization technique and compared with the traditional strategies to address the robustness issue of the proposed design scheme. The rest of the paper is organised as follows: Section 2 presents the problem formulation including the open-loop optimization and reinforcement learning-based optimal controller design which have been described in subsections 2.1 and 2.2 respectively. The parametric system identification and value function approximation have been detailed in subsections 2.2.1 and 2.2.2. This is followed by the stability and robustness analysis (using the H-infinity (H ∞ ) performance index ) of the closed loop system in subsection 2.2.4. Section 3 provides the simulation results and discussion followed by the conclusion in Section 4.

2. PROBLEM FORMULATION

Considering the fuel management system of a hybrid electric vehicle as a continous time affine non-linear dynamical system: ẋ = f (x, w) + g(x)u, y = h(x, v) where, x ∈ R nx , y ∈ R ny , u ∈ R nu are the state, output and the control vectors respectively, f(.) denotes the drift dynamics and g(.) denotes the control effectivenss matrix. The functions f and h are assumed to be locally Lipschitz continuous functions such that f(0) = 0 and ∇f(x) is continous for every bounded x ∈ R nx . The process noise w and measurement noise v are assumed to be zero-mean, uncorrelated Gausssian white noise with covariances W and V, respectively. Assumption 1: We consider the system to be fully observed: y = h(x, v) = x (2) Remark 1: This assumption is considered to provide a tractable formulation of the fuel management problem to side step the need for a complex treatment which is required when a stochastic control problem is treated as partially observed MDP (POMDP). Optimal Control Problem: For a continous time system with unknown nonlinear dynamics f(.), we need to find an optimal control policy π t in a finite time horizon [0, t] where π t is the control policy at time t such that π t = u(t) to minimize the cost function given by J = t 0 (x T Qx + uRu T )dt + x T F x where, Q,F > 0 and R ≥ 0.

2.1. OPEN LOOP OPTIMIZATION

Considering a noise-free non-linear stochastic dynamical system with unknown dynamics: ẋ = f (x, 0) + g(x)u, y = h(x, v) = x (3) where, x 0 ∈ R nx , y ∈ R ny , u ∈ R nu are the initial state, output and the control vectors respectively, f(.) have their usual meanings and the corresponding cost function is given by J d (x 0 , u t ) = t 0 (x T Qx + uRu T )dt + x T F x. Remark: We have used piecewise convex function to approximate the non-convex fuel function globally which has been used to formulate the cost function for the fuel optimization. The open loop optimization problem is to find the control sequence u t such that for a given initial state x 0 , ūt = arg min J d (x 0 , u t ), subject to ẋ = f (x, 0) + g(x)u, y = h(x, v) = x. The problem is solved using the gradient descent approach (Bryson et al., 1962; Gosavi et al., 2003) , and the procedure is illustrated as follows: Starting from a random initial value of the control sequence 0) ] the control policy is updated iteratively as U (0) = [u t U (n+1) = U (n) -α∇ U J d (x 0 , U (n) ), until the convergence is achieved upto a certain degree of accuracy where U (n) denotes the control value at the n th iteration and α is the step size parameter. The gradient vector is given by: ∇ U J d (x 0 , U (n) ) = ( ∂J d ∂u 0 , ∂J d ∂u 1 , ∂J d ∂u 2 , ....., ∂J d ∂u t )| (x0,ut) The Gradient Descent Algorithm showing the approach has been detailed in the Appendix A.1.

Remark 2:

The open loop optimization problem is thus solved using the gradient descent approach considering a black-box model of the underlying system dynamics using a sequence of input-output tests without having the perfect knowlegde about the non-linearities in the model at the time of the design. This method proves to be a very simple and useful strategy for implementation in case of complex dynamical systems with complicated cost-to-go functions and suitable for parallelization.

2.2. REINFORCEMENT LEARNING BASED OPTIMAL CONTROLLER DESIGN

Considering the affine non-linear dynamical system given by equation ( 1), our objective is to design a control law to track the optimal time-varying trajectory x(t) ∈ R nx . A novel cost function is formulated in terms of the tracking error defined by e = x(t) -x(t) and the control error defined by the difference between the actual control signal and the desired optimal control signal. This formulation helps to overcome the challenge of the infinte cost posed by the cost function when it is defined in terms of the tarcking error e(t) and the actual control signal signal u(t) only (Zhang et al., 2011; Kamalapurkar et al., 2015) . The following assumptions is made to determine the desired steady state control. Assumption 2: (Kamalapurkar et al., 2015) The function g(x) in equation ( 1) is bounded, the matrix g(x) has full column rank for all x(t) ∈ R nx and the function g + : R n → R mXn which is defined as g + = (g T g) -1 is bounded and locally Lipschitz. Assumption 3: (Kamalapurkar et al., 2015) The optimal trajectory is bounded by a known positive constant b R such that x ≤ b and there exists a locally Lipschitz function h d such that ẋ = h d (x) and g(x) g + (x)(h d (x) -f (x)) = h d (x) -f (x). Using the Assumption 2 and Assumption 3, the control signal u d required to track the desired trajectory x(t) is given as u d (x) = g + d (h d (x) -f d ) where f d = f (x) and g + d = g + (x) . The control error is given by µ = u(t) -u d (x). The system dynamics can now be expressed as ζ = F (ζ) + G(ζ)µ (7) where, the merged state ζ(t) ∈ R 2n is given by ζ(t) = [e T , xT ] T and the functions F (ζ) and G(ζ) are defined as F (ζ) = [f T (e + x) -h T d + u T d (x)g T (e + x), h T d ] T and G(ζ) = [g T (e + x ), 0 mXn ] T where, 0 mXn denotes a matrix of zeroes. The control error µ is treated hereafter as the design variable. The control objective is to solve a finite-horizon optimal tracking problem online, i.e., to design a control signal µ that will minimize the cost-to-go function, while tracking the desired trajectory, is given by J(ζ, µ) = Based on the assumption of the existence of an optimal policy, it can be characterized in terms of the value function V * : R 2n → R which is defined as V * (ζ) = min µ(τ ) U |τ R t>0 t 0 r(φ u (π, t, ζ), µ(τ ))dτ , where U ∈ R m is the action space and φ u (t; t 0 , ζ 0 ) is the trajectory of the system defined by equation ( 10) with the control effort µ : R >0 → R m with the initial condition ζ 0 ∈ R 2n and the initial time t 0 ∈ R >0 . Taking into consideration that an optimal policy exists and that V * is continously differentiable everywhere, the closed-form solution (Kirk, 2004)  is given as µ * (ζ) = -1/2 R -1 G T (ζ)(∇ ζ V * (ζ)) T where, ∇ ζ (.) = ∂(.) ∂x . This satisfies the Hamilton-Jacobi-Bellman (HJB) equation (Kirk, 2004) given as ∇ ζ V * (ζ)(F (ζ) + G(ζ)µ * (ζ)) + Q(ζ) + µ * T (ζ)Rµ * (ζ) = 0 (8) where, the initial condition V * = 0, and the funtion Q : R 2n → R is defined as Q([e T , xT ] T ) = Q(e) where, (e(t), x(t)) ∈ R n . Since, a closed-form solution of the HJB equation is generally infeasible to obtain, we sought an approximate solution. Therefore, an actor-critic based method is used to obtain the parametric estimates of the optimal value function and the optimal policy which are given as V (ζ, Ŵc ) and μ(ζ, Ŵa ) where, Ŵc ∈ R L and Ŵa ∈ R L define the vector paramater estimates. The task of the actor and critic is to learn the corresponding parameters. Replacing the estimates V and μ for V * and μ * in the HJB equation, we obtain the residual error, also known as the Bell Error (BE) as δ(ζ, Ŵc , Ŵa ) = Q(ζ) + μT (ζ, Ŵa )Rμ(ζ, Ŵa ) + ∇ ζ V (ζ, Ŵc )(F (ζ) + G(ζ)μ(ζ, Ŵa )) where, δ : R 2n X R L X R L → R. The solution of the problem requires the actor and the critic to find a set of parameters Ŵa and Ŵc respectively such that δ(ζ, Ŵc , Ŵa ) = 0 and μT (ζ, Ŵa ) = -1/2 R -1 G T (ζ)(∇ ζ V * (ζ)) T where, ∀ζ ∈ R n . As the exact basis fucntion for the approximation is not known apriori, we seek to find a set of approximate parameters that minimizes the BE. However, an uniform approximation of the value function and the optimal control policy over the entire operating domain requires to find parameters that will able to minimize the error E s : R L X R L → R defined as E s ( Ŵc , Ŵa ) = sup ζ (|δ, Ŵc , Ŵa |) thus, making it necessary to have an exact knowledge of the system model. Two of the most popular methods used to render the design of the control strategy robust to system uncertainties in this context are integral RL (Lewis et al., 2012; Modares et al., 2014) and state derivative estimation (Bhasin et al., 2013; Kamalapurkar et al., 2014) . Both of these methods suffer from the persistence of exitation(PE) condition that requires the state trajectory φ û(t; t 0 , ζ 0 ) to cover the entire operating domain for the convergence of the parameters to their optimal values. We have relaxed this condition where the integral technique is used in augmentation with the replay of the experience where every evaluation of the BE is intuitively formalized as a gained experience, and these experiences are kept in a history stack so that they can be iteratively used by the learning algorithm to improve data efficiency. Therefore, to relax the PE condition, the we have developed a CL-based system identifier which is used to model the parametric estimate of the system drift dynamics and is used to simulate the experience by extrapolating the Bell Error (BE) over the unexplored territory in the operating domain thereby, prompting an exponential convergence of the parameters to their optimal values.

2.2.1. PARAMETRIC SYSTEM IDENTIFICATION

Defined by any compact set C ⊂ R, the function f can be defined using a neural network (NN) as f(x) = θ T σ f (Y T x 1 ) + 0 (x) where, x 1 = [1, x T ] T ∈ R n+1 , θ ∈ R n+1Xp and Y ∈ R n+1Xp indicates the constant unknown output-layer and hidden-layer NN weight, σ f : R p → R p+1 denotes a bounded NN activation function, θ : R n → R n is the function reconstruction error, p ∈ N denotes the number of NN neurons. Using the universal functionional approximation property of single layer NNs, given a constant matrix Y such that the rows of σ f (Y T x 1 ) form a proper basis, there exist constant ideal weights θ and known constants θ, ¯ θ , ¯ θ ∈ R such that ||θ|| < θ < ∞, sup x C || θ (x)|| < ¯ θ , sup x C ||∇ x θ (x)|| < ¯ θ where, ||. || denotes the Euclidean norm for vectors and the Frobenius norm for matrix (Lewis et al., 1998) . Taking into consideration an estimate θ ∈ R p+1Xn of the weight matrix θ , the function f can be approximated by the function f : R 2n X R p+1Xn → R n which is defined as f (ζ, θ) = θT σ θ (ζ), where σ θ : R 2n → R p+1 can be defined as σ θ (ζ) = σ f (Y T [1, e T + xT ] T ). An estimator for online identification of the drift dynamics is developed ẋ = θT σ θ (ζ) + g(x)u + kx where, x = x -x and k R R is a positive constant learning gain. Assumption 4: A history stack containing recorded state-action pairs {x j , u j } M j=1 along with numerically computed state derivatives { ẋj } M j=1 that satisfies λ min M j=1 σ f j σ T f j = σ θ > 0, ẋj -ẋj < d, ∀j is available a priori, where σ f j σ f Y T 1, x T j T , d ∈ R is a known positive constant, ẋj = f (x j ) + g (x j ) u j and λ min (•) denotes the minimum eigenvalue. The weight estimates θ are updated using the following CL based update law: θ = Γ θ σ f Y T x 1 xT + k θ Γ θ M j=1 σ f j ẋj -g j u j -θT σ f j T ( ) where k θ ∈ R is a constant positive CL gain, and Γ θ ∈ R p+1×p+1 is a constant, diagonal, and positive definite adaptation gain matrix. Using the identifier, the BE in (3) can be approximated as δ ζ, θ, Ŵc , Ŵa = Q(ζ) + μT ζ, Ŵa Rμ ζ, Ŵa +∇ ζ V ζ, Ŵa F θ (ζ, θ) + F 1 (ζ) + G(ζ)μ ζ, Ŵa The BE is now approximated as δ ζ, θ, Ŵc , Ŵa = Q(ζ) + μT ζ, Ŵa Rμ ζ, Ŵa +∇ ζ V ζ, Ŵa F θ (ζ, θ) + F 1 (ζ) + G(ζ)μ ζ, Ŵa In equation ( 12), F θ (ζ, θ) =   θT σ θ (ζ) -g(x)g + (x d ) θT σ θ 0 n×1 x d 0 n×1   , and F 1 (ζ) = (-h d + g (e + x d ) g + (x d ) h d ) T , h T d T .

2.2.2. VALUE FUNCTION APPROXIMATION

As V * and µ * are functions of the state ζ, the optimization problem as defined in Section 2.2 is quite an intractable one, so the optimal value function is now represented as C ⊂ R 2n using a NN as Lewis et al., 1998) . V * (ζ) = W T σ(ζ)+ (ζ), A NN representation of the optimal policy is obtained as µ * (ζ) = - 1 2 R -1 G T (ζ) ∇ ζ σ T (ζ)W + ∇ ζ T (ζ) Taking the estimates Ŵc and Ŵa for the ideal weights W , the optimal value function and the optimal policy are approximated as V ζ, Ŵc = Ŵ T c σ(ζ), μ ζ, Ŵa = -1 2 R -1 G T (ζ)∇ ζ σ T (ζ) Ŵa . The optimal control problem is therefore recast as to find a set of weights Ŵc and Ŵa online to minimize the error Êθ Ŵc , Ŵa = sup ζ∈χ δ ζ, θ, Ŵc , Ŵa for a given θ, while simultaneously improving θ using the CL-based update law and ensuring stability of the system using the control law u = μ ζ, Ŵa + ûd (ζ, θ) where, ûd (ζ, θ) = g + d h d -θT σ θd , and σ θd = σ θ 0 1×n x T d T . σ θ 0 1×n x T d T . The error between u d and ûd is included in the stability analysis based on the fact that the error trajectories generated by the system ė = f (x) + g(x)u -ẋd under the controller in ( 14) are identical to the error trajectories generated by the system ζ = F (ζ) + G(ζ)µ under the control law µ = μ ζ, Ŵa + g + d θT σ θd + g + d θd , where θd θ (x d ).

2.2.3. EXPERIENCE SIMULATION

The simulation of experience is implemented by minimizing a squared sum of BEs over finitely many points in the state space domain as the calculation of the extremum (supremum) in Êθ is not tractable. The details of the analysis has been explained in Appendix A.2 which facilitates the aforementioned approximation.

2.2.4. STABILITY AND ROBUSTNESS ANALYSIS

To perform the stability analysis, we take the non-autonomous form of the value function (Kamalapurkar et al., 2015) defined by V * t : R n X R → R which is defined as V * t (e, t) = V * e T , x T d (t) T , ∀e ∈ R n , t ∈ R , is positive definite and decrescent. Now, V * t (0, t) = 0, ∀t ∈ R and there exist class K functions v : R → R and v : R → R such that v( e ) ≤ V * t (e, t) ≤ v( e ), for all e ∈ R n and for all t ∈ R. We take an augemented state given as Z ∈ R 2n+2L+n(p+1) is defined as Z = e T , W T c , W T a , xT , (vec( θ)) T T ( ) and a candidate Lyapunov function is defined as V L (Z, t) = V * t (e, t) + 1 2 W T c Γ -1 Wc + 1 2 W T a Wa 1 2 xT x + 1 2 tr θT Γ -1 θ θ where, vec (•) denotes the vectorization operator. From the weight update in Appendix A.2 we get positive constants γ, γ ∈ R such that γ ≤ Γ -1 (t) ≤ γ, ∀t ∈ R. Taking the bounds on Γ and V * t and the fact that tr θT Γ -1 θ θ = (vec( θ)) T Γ -1 θ ⊗ I p+1 (vec( θ)) the candidate Lyapunov function be bounded as v l ( Z ) ≤ V L (Z, t) ≤ vl ( Z ) for all Z ∈ R 2n+2L+n(p+1) and for all t ∈ R, where v l : R → R and v l : R → R are class K functions. Now, Using (1) and the fact that V * t (e(t), t) = V * (ζ(t)), ∀t ∈ R, the time-derivative of the candidate Lyapunov function is given by VL = ∇ ζ V * (F + Gµ * ) -W T c Γ -1 Ẇc - 1 2 W T c Γ -1 ΓΓ -1 Wc -W T a Ẇa + V0 + ∇ ζ V * Gµ -∇ ζ V * Gµ * Under sufficient gain conditions (Kamalapurkar et al., 2014) , using ( 9), ( 10)-( 13), and the update laws given by Ŵc , Γ and Ŵa the time-derivative of the candidate Lyapunov function can be bounded as VL ≤ -v l ( Z ), ∀ Z ≥ v -1 l (ι), ∀Z ∈ χ (19) where ι is a positive constant, and χ ⊂ R 2n+2L+n(p+1) is a compact set. Considering (13) and (15), the theorem 4.18 in (Khalil., 2002) can be used to establish that every trajectory Z(t) satisfying Z (t 0 ) ≤ v l -1 (v l (ρ)) , where ρ is a positive constant, is bounded for all t ∈ R and satisfies lim sup t→∞ Z(t) ≤ v l -1 v l v -1 l (ι) . This aforementioned analysis addresses the stability issue of the closed loop system. The robustness criterion requires the algorithm to satisfy the following inequality (Gao et al., 2014) in the presence of external disturbances with a pre-specified performance index γ known as the H-infinity (H ∞ ) performance index, given by t 0 y(t) 2 dt < γ 2 t 0 w(t) dt where, y(t) is the output of the system, w(t) is the factor that accounts for the modeling errors, parameter uncertainties and external disturbances and γ is the ratio of the output energy to the disturbance in the system. Using (1) and the fact that V * t (e(t), t) = V * (ζ(t)), ∀t ∈ R, the time-derivative of the candidate Lyapunov function is given by VL = ∇ ζ V * (F + Gµ * ) -W T c Γ -1 Ẇc - 1 2 W T c Γ -1 ΓΓ -1 Wc -W T a Ẇa + V0 + ∇ ζ V * Gµ -∇ ζ V * Gµ * Under review as a conference paper at ICLR 2021 Gao et al. (2014) has shown if ( 22) and ( 23) is satisfied, then it can written that 0 < V L (T ) = t 0 VL (t) ≤ - t 0 y T (t)y(t)dt + γ 2 t 0 w T (t)w(t)dt (22) Thus, the performance inequality constraint given by t 0 y(t) 2 dt < γ 2 t 0 w(t) dt in terms of γ is satisfied.

3. SIMULATION RESULTS AND DISCUSSION

Here, we are going to present the simulation results to demonstrate the performance of the proposed method with the fuel management system of the hybrid electric vehicle. The proposed concurrent learning based RL optimization architecture has been shown in the Figure 1 . In this architecture, the simulated state-action-derivative triplets performs the action of concurrent learning to approximate the value function weight estimates to minimize the bell error (BE). The history stack is used to store the evaluation of the bell error which is carried out by a dynamic system identifier as a gained experience so that it can iteratively used to reduce the computational burden. A simple two dimensional model of the fuel management system is being considered for the tion purpose to provide a genralized solution that can be extended in other cases of high dimensional system. We consider a two dimensional non-linear model given by  f = x 1 x 2 0 0 0 0 x 1 x 2 (1 -(cos(2x 1 + 2) 2 )) *    a b c d    , g = 0 cos(2x 1 + 2) , w(t) = sin(t) (x) = 1 2 x 2 1 + 1 2 x 2 2 and u * (x) = -cos(2(x 1 ) + 2)x 2 . The basis function σ : R 2 → R 3 for value function approximation is σ = [x 2 1 , x 2 1 x 2 2 , x 2 2 ] . The ideal weights are W = [0.5, 0, 1]. The initial value of the policy and the value function weight estimates are Ŵc = Ŵa = [1, 1, 1] T , least square gain is Γ(0) = 100I 3X3 and that of the system states are x(0) = [-1, -1] T . The state estimates x and θ are initialized to 0 and 1 respectively while the history stack for the CL is updated online. Here, Figure 2 and Figure 3 shows the state trajectories obtained by the 

4. CONCLUSION

In this paper, we have proposed a robust concurrent learning based deep Rl optimization strategy for hybrid electric vehicles. The uniqueness of this method lies in use of a concurrent learning based RL optimization strategy that reduces the computational complexity significanty in comparison to the traditional RL approaches used for the fuel management system mentioned in the literature. Also, the use of the the H-infinity (H ∞ ) performance index in case of RL optimization for the first time takes care of the robustness problems that most the fuel optimization nethods suffer from. The simulation results validate the efficacy of the method over the conventional PID, MPC as well as traditional RL based optimization techniques. Future work will generalize the approach for largescale partially observed uncertain systems and it will also incorporate the movement of neighbouring RL agents. 

A APPENDIX

A.1 THE GRADIENT DESCENT ALGORITHM The Gradient Descent Algorithm has been explained as follows: Algorithm; Gradient Descent Input : Design Parameters U (0) = u 0 t , α, h R Output : Optimal control sequence {ū t } 1. n ← 0, ∇ U J d x 0 , U (0) ← 2. while ∇ U J d x 0 , U (n) ≥ do 3. Evaluate the cost function with control U (n) 4. Perturb each control variable u (n) i by h, i = 0, • • • , t, and calculate the gradient vector ∇ U J d x 0 , U (n) using ( 7) and (8) 5. Update the control policy: U (n+1) ← U (n) -α∇ U J d x 0 , U (n) 6. n ← n + 1 



(τ ), µ(τ ))dτ where, the local cost r : R 2n XR m → R is given as r(ζ, τ ) = Q(e) + µ T Rµ, R R mXm is a positive definite symmetric matrix and Q : R n → R is a continous positive definite function.

where W ∈ R L denotes a vector of unknown NN weights, σ : R 2n → R L indicates a bounded NN activation function, : R 2n → R defines the function reconstruction error, and L ∈ N denotes the number of NN neurons. Considering the universal function approximation property of single layer NNs, for any compact set C ⊂ R 2n , there exist constant ideal weights W and known positive constants W , ¯ , and ∈ R such that W ≤ W < ∞ sup ζ∈C (ζ) ≤ ¯ , and sup ζ∈C ∇ ζ (ζ) ≤ ¯ (

Figure 1: Reinforcement Learning-based Optimization Architecture

23) where a, b, c, d ∈ R are unknown positive parameters whose values are selected as a = -1, b = 1, c = -0.5, d = -0.5, x 1 and x 2 are the two states of the hybrid electric vehicle given by the charge present in the battery and the amount of fuel in the car respectively and w(t) = sin(t) is a sinusoidal disturbance that is used to model the external disturbance function. The control objective is to minimize the cost function given by J(ζ, µ) = t 0 r(ζ(τ ), µ(τ ))dτ where, the local cost r : R 2n XR m → R is given as r(ζ, τ ) = Q(e) + µ T Rµ, R R mXm is a positive definite symmetric matrix and Q : R n → R is a continous positive definite function, while following the desired trajectory x We chhose Q = I 2x2 and R = 1. The optimal value function and optimal control for the system (15) are V *

Figure 2: State Trajectories Figure 3: State Trajectories Figure 4: Control Input

t } ← U (n)A.2 EXPERIENCE SIMULATION Assumption 5:(Kamalapurkar et al., 2014) There exists a finite set of points{ζ i ∈ C | i = 1, • • • , N } and a constant c ∈ R such that 0 < c = 1 N inf t∈R ≥t 0 λ min Γω i ∈ R, and ω i = ∇ ζ σ (ζ i ) F θ ζ i , θ + F 1 (ζ i ) + G (ζ i ) μ ζ i , Ŵa .Using Assumption 5, simulation of experience is implemented by the weight update laws given byŴc = -η c1 Γ βΓ -η c1 Γ ωω T ρ 2 Γ 1 { Γ ≤ Γ} , Γ (t 0 ) ≤ Γ,(25)Ẇa = -η a1 Ŵa -Ŵc -η a2 Ŵa + η c1 G T σ Ŵa ω T 4ρ + N i=1 η c2 G T σi Ŵa ω T i 4N ρ i Ŵc (26) where, ω = ∇ ζ σ(ζ) F θ (ζ, θ) + F 1 (ζ) + G(ζ)μ ζ, Ŵa , Γ ∈ R L×L is the least-squares gain matrix, Γ ∈ R denotes a positive saturation constant, β ∈ R indicates a constant forgetting factor, η c1 , η c2 , η a1 , η a2 ∈ R defines constant positive adaptation gains, 1 {•} denotes the indicator function of the set {•}, G σ = ∇ ζ σ(ζ)G(ζ)R -1 G T (ζ)∇ ζ σ T (ζ), and ρ = 1 + νω T Γω, where ν ∈ R is a positive normalization constant. In the above weight update laws, for any function ξ(ζ, •), the notation ξ i , is defined as ξ i = ξ (ζ i , •) , and the instantaneous BEs δt and δti are given as δt = δ ζ, Ŵc , Ŵa , θ and δti = δ ζ i , Ŵc , Ŵa , θ .

cations. In IEEE Applied Power Electronics Conference and Exposition (APEC), pp. 3397-3402, 2016. X. Zhang, C. C. Mi, A. Masrur, and D. Daniszewski. Wavelet transform-based power management of hybrid vehicles with multiple on-board energy sources including fuel cell, battery and ultracapacitor. Journal of Power Sources, 185(2): 1533-1543, 2008. C. Zheng, S. W. Cha, Y. Park, W. S. Lim, and G. Xu. PMP-based power management strategy of fuel cell hybrid vehicles considering multi-objective optimization. International Journal Precision Engineering and Manufacturing, 14(5): 845-853, 2013. C. H. Zheng, G. Q. Xu, Y. I. Park, W. S. Lim, and S. W. Cha. Prolonging fuel cell stack lifetime based on Pontryagin's Minimum Principle in fuel cell hybrid vehicles and its economic influence evaluation. Journal Power Sources, 248: 533-544, 2014. X. Zhong, H. He, H. Zhang, and Z. Wang. Optimal Control for Unknown Diiscrete-Time Nonlinear Markov Jump Systems Using Adaptive Dynamic Programming. IEEE Transactions on Neural networks and learning systems, 25(12): pp. 2141-2155, 2014.

