A ROBUST FUEL OPTIMIZATION STRATEGY FOR HY-BRID ELECTRIC VEHICLES: A DEEP REINFORCEMENT LEARNING BASED CONTINUOUS TIME DESIGN AP-PROACH

Abstract

This paper deals with the fuel optimization problem for hybrid electric vehicles in reinforcement learning framework. Firstly, considering the hybrid electric vehicle as a completely observable non-linear system with uncertain dynamics, we solve an open-loop deterministic optimization problem to determine a nominal optimal state. This is followed by the design of a deep reinforcement learning based optimal controller for the non-linear system using concurrent learning based system identifier such that the actual states and the control policy are able to track the optimal state and optimal policy, autonomously even in the presence of external disturbances, modeling errors, uncertainties and noise and signigicantly reducing the computational complexity at the same time, which is in sharp contrast to the conventional methods like PID and Model Predictive Control (MPC) as well as traditional RL approaches like ADP, DDP and DQN that mostly depend on a set of pre-defined rules and provide sub-optimal solutions under similar conditions. The low value of the H-infinity (H ∞ ) performance index of the proposed optimization algorithm addresses the robustness issue. The optimization technique thus proposed is compared with the traditional fuel optimization strategies for hybrid electric vehicles to illustate the efficacy of the proposed method.

1. INTRODUCTION

Hybrid electric vehicles powered by fuel cells and batteries have attracted great enthusiasm in modern days as they have the potential to eliminate emissions from the transport sector. Now, both the fuel cells and batteries have got several operational challenges which make the separate use of each of them in automotive systems quite impractical. HEVs and PHEVs powered by conventional diesel engines and batteries merely reduce the emissions, but cannot eliminate completely. Some of the drawbacks include carbon emission causing environmental pollution from fuel cells and long charging times, limited driving distance per charge, non-availability of charging stations along the driving distance for the batteries. Fuel Cell powered Hybrid Electric Vehicles (FCHEVs) powered by fuel cells and batteries offer emission-free operation while overcoming the limitations of driving distance per charge and long charging times. So, FCHEVs have gained significant attention in recent years. As we find, most of the existing research which studied and developed several types of Fuel and Energy Management Systems (FEMS) for transport applications include Sulaiman et al. (2018) who has presented a critical review of different energy and fuel management strategies for FCHEVs. Li et al. (2017) has presented an extensive review of FMS objectives and strategies for FCHEVs. These strategies, however can be divided into two groups, i.e., model-based and modelfree. The model-based methods mostly depend on the discretization of the state space and therefore suffers from the inherent curse of dimensionality. The coumputational complexity increases in an exponential fashion with the increase in the dimension of the state space. This is quite evident in the methods like state-based EMS (Jan et al., 2014; Zadeh et al., 2014; 2016) , rule-based fuzzy logic strategy (Motapon et al., 2014), classical PI and PID strategies (Segura et al., 2012 ), Potryagin's minimum principle (PMP) (Zheng et al., 2013; 2014) , model predictive control (MPC) (Kim et al., 2007; Torreglosa et al., 2014) and differential dynamic programming (DDP) (Kim et al., 2007) . Out of all these methods, differential dynamic programming is considered to be computationally quite efficient which rely on the linearization of the non-linear system equations about a nominal state trajectory followed by a policy iteration to improve the policy. In this approach, the control policy for fuel optimization is used to compute the optimal trajectory and the policy is updated until the convergence is achieved. The model-free methods mostly deal with the Adaptive Dynamic Programming (Bithmead et al., 1991; Zhong et al., 2014) and Reinforcement Learning (RL) based strategies (Mitrovic et al., 2010; Khan et al., 2012) icluding DDP (Mayne et al., 1970) . Here, they tend to compute the control policy for fuel optimization by continous engagement with the environment and measuring the system response thus enabling it to achieve at a solution of the DP equation recursively in an online fashion. In deep reinforcement learning, multi-layer neural networks are used to represent the learning function using a non-linear parameterized approximation form. Although a compact paremeterized form do exist for the learning function, the inability to know it apriori renders the method suffer from the curse of dimensionality (O(d 2 ) where, d is the dimension of the state space), thus making it infeasible to apply to a high-dimemsional fuel managememt system. The problem of computational complexity of the traditional RL methods like policy iteration (PI) and value iteration (VI) (Bellman et al., 1954; 2003; Barto et al., 1983; Bartsekas, 2007) can be overcome by a simulation based approach (Sutton et al., 1998) where the policy or the value function can be parameterized with sufficient accuracy using a small number of parameters. Thus, we will be able to transform the optimal control problem to an approximation problem in the parameter space (Bartesekas et al., 1996; Tsitsiklis et al., 2003; Konda et al., 2004) side stepping the need for model knowledge and excessive computations. However, the convergence requires sufficient exploration of the state-action space and the optimality of the obtained policy depends primarily on the accuracy of the parameterization scheme. As a result, a good approximation of the value function is of utmost importance to the stability of the closed-loop system and it requires convergence of the unknown parameters to their optimal values. Hence, this sufficient exploration condition manifests itself as a persistence of excitation (PE) condition when RL is implemented online (Mehta et al., 2009; Bhasin et al., 2013; Vrabie, 2010) which is impossible to be guaranteed a priori. Most of the traditional approaches for fuel optimization are unable to adrress the robustness issue. The methods described in the literature including those of PID (Segura et al.,2012) , Model Predictive Control (MPC) (Kim et al.,2007; Torreglosa et al., 2014) and Adaptive Dynamic Programming (Bithmead et al.,1991; Zhong et al., 2014) as well as the simulation based RL strategies (Bartesekas et al., 1996; Tsitsiklis et al., 2003; Konda et al., 2004 ) suffer from the drawback of providing a suboptimal solution in the presence of external disturbances and noise. As a result, application of these methods for fuel optimization for hybrid electric vehicles that are plagued by various disturbances in the form of sudden charge and fuel depletion, change in the environment and in the values of the parameters like remaining useful life, internal resistance, voltage and temperature of the battery, are quite impractical. The fuel optimization problem for the hybrid electric vehicle therefore have been formulated as a fully observed stochastic Markov Decision Process (MDP). Instead of using Trajectory-optimized LQG (T-LQG) or Model Predictive Control (MPC) to provide a sub-optimal solution in the presence of disturbances and noice, we propose a deep reinforcement learning-based optimization strategy using concurrent learning (CL) that uses the state-derivative-action-reward tuples to present a robust optimal solution. The convergence of the weight estimates of the policy and the value function to their optimal values justifies our claim. The two major contributions of the proposed approch can be therefore be summarized as follows: 1) The popular methods in RL literature including policy iteration and value iteration suffers from the curse of dimensionality owing to the use of a simulation based technique which requires sufficient exploration of the state space (PE condition). Therefore, the proposed model-based RL scheme aims to relax the PE condition by using a concurrent learning (CL)-based system identifier to reduce the computational complexity. Generally, an estimate of the true controller designed using the CLbased method introduces an approximate estimation error which makes the stability analysis of the system quite intractable. The proposed method, however, has been able to establish the stability of the closed-loop system by introducing the estimation error and analyzing the augmented system trajectory obtained under the influnece of the control signal.

