TEAC: INTEGRATING TRUST REGION AND MAX ENTROPY ACTOR CRITIC FOR CONTINUOUS CONTROL

Abstract

Trust region methods and maximum entropy methods are two state-of-the-art branches used in reinforcement learning (RL) for the benefits of stability and exploration in continuous environments, respectively. This paper proposes to integrate both branches in a unified framework, thus benefiting from both sides. We first transform the original RL objective to a constraint optimization problem and then proposes trust entropy actor-critic (TEAC), an off-policy algorithm to learn stable and sufficiently explored policies for continuous states and actions. TEAC trains the critic by minimizing the refined Bellman error and updates the actor by minimizing KL-divergence loss derived from the closed-form solution to the Lagrangian. We prove that the policy evaluation and policy improvement in TEAC is guaranteed to converge. We compare TEAC with 4 state-of-the-art solutions on 6 tasks in the MuJoCo environment. The results show that TEAC with optimized parameters achieves similar performance in half of the tasks and notably improvement in the others in terms of efficiency and effectiveness.

1. INTRODUCTION

With the use of high-capacity function approximators, such as neural networks, reinforcement learning (RL) becomes practical in a wide range of real-world applications, including game playing (Mnih et al., 2013; Silver et al., 2016) and robotic control (Levine et al., 2016; Haarnoja et al., 2018a) . However, when dealing with the environments with continuous state space or/and continuous action space, most existing deep reinforcement learning (DRL) algorithms still suffer from unstable learning processes and are impeded from converging to the optimal policy. The reason for unstable training process can be traced back to the use of greedy or -greedy policy updates in most algorithms. With the greedy update, a small error in value functions may lead to abrupt policy changes during the learning iterations. Unfortunately, the lack of stability in the training process makes the DRL unpractical for many real-world tasks (Peters et al., 2010; Schulman et al., 2015; Tangkaratt et al., 2018) . Therefore, many policy-based methods have been proposed to improve the stability of policy improvement (Kakade, 2002; Peters & Schaal, 2008; Schulman et al., 2015; 2017) . Kakade (2002) proposed a natural policy gradient-based method which inspired the design of trust region policy optimization (TRPO). The trust region, defined by a bound of the Kullback-Leibler (KL) divergence between the new and old policy, was formally introduced in Schulman et al. (2015) to constrain the natural gradient policy changing within the field of trust. An alternative to enforcing a KL divergence constraint is to utilize the clipped surrogate objective, which was used in Proximal Policy Optimization (PPO) (Schulman et al., 2017) to simplify the objective of TRPO while maintaining similar performance. TRPO and PPO have shown significant performance improvement on a set of benchmark tasks. However, these methods are all on-policy methods requiring a large number of on-policy interaction with environment for each gradient step. Besides, these methods focus more on the policy update than exploration, which is not conducive to finding the global optimal policy. The globally optimal behavior is known to be difficult to learn due to sparse rewards and insufficient explorations. In addition to simply maximize the expected reward, maximum entropy RL (MERL) (Ziebart et al., 2008; Toussaint, 2009; Haarnoja et al., 2017; Levine, 2018) proposes to extend the conventional RL objective with an additional "entropy bonus" argument, resulting in the preferences to the policies with higher entropy. The high entropy of the policy explicitly encourages exploration, thus improving the diverse collection of transition pairs, allowing the policy to capture multi-modes of good policies, and preventing from premature convergence to local optima. MERL reforms the reinforcement learning problem into a probabilistic framework to learn energy-based policies to maintain the stochastic property and seek the global optimum. The most representative methods in this category are soft Q-learning (SQL) (Haarnoja et al., 2017) and Soft Actor Critic (SAC) (Haarnoja et al., 2018b; c) . SQL defines a soft Bellman equation and implements it in a practical off-policy algorithm which incorporates the entropy of the policy into the reward to encourage exploration. However, the actor network in SQL is treated as an approximate sampler, and the convergence of the method depends on how well the actor network approximates the true posterior. To address this issue, SAC extends soft Q-learning to actor-critic architecture and proves that a given policy class can converge to the optimal policy in the maximum entropy framework. However, offpolicy DRL is difficult to stabilize in policy improvement procedure (Sutton & Barto, 1998; van Hasselt et al., 2018; Ciosek et al., 2019) which may lead to catastrophic actions, such as ending the episode and preventing further learning. Several models have been proposed to benefit from considering both the trust region constraint and the entropy constraint, such as MOTO (Akrour et al., 2016) , GAC (Tangkaratt et al., 2018), and Trust-PCL (Nachum et al., 2018) . However, MOTO and GAC cannot efficiently deal with highdimensional action space because they rely on second-order computation, and Trust-PCL suffers from algorithm efficiency due to its requirement of trajectory/sub-trajectory samples to satisfy the pathwise soft consistency. Therefore, in this paper, we propose to further explore the research lines of unifying trust region policy-based methods and maximum entropy methods. Specifically, we first transform the RL problem into a primal optimization problem with four additional constraints to 1) set an upper bound of KL divergence between the new policy and the old policy to ensure the policy changes are within the region of trust, 2) provide a lower bound of the policy entropy to prevent from a premature convergence and encourage sufficient exploration, and 3) restrain the optimization problem as a Markov Decision Process (MDP). We then leverage the Lagrangian duality to the optimization problem to redefine the Bellman equation which is used to verify the policy evaluation and guarantee the policy improvement. Thereafter, we propose a practical trust entropy actor critic (TEAC) algorithm, which trains the critic by minimizing the refined Bellman error and updates the actor by minimizing KL-divergence loss derived from the closed-form solution to the Lagrangian. The update procedure of the actor involves two dual variables w.r.t. the KL constraint and entropy constraint in the Lagrangian. Based on the Lagrange dual form of the primal optimization problem, we develop gradient-based method to regulate the dual variables regarding the optimization constraints. The key contribution of the paper is a novel off-policy trust-entropy actor-critic (TEAC) algorithm for continuous controls in DRL. In comparison with existing methods, the actor of TEAC updates the policy with the information from the old policy and the exponential of the current Q function, and the critic of TEAC updates the Q function with the new Bellman equation. Moreover, we prove that the policy evaluation and policy improvement in trust entropy framework is guaranteed to converge. A detailed comparison with similar work, including MOTO (Akrour et al., 2016) , GAC (Tangkaratt et al., 2018), and Trust-PCL (Nachum et al., 2018) , is provided in Sec. 4 to explain that TEAC is the most effective and most theoretically complete method. We compare TEAC with 4 state-of-the-art solutions on the tasks in the MuJoCo environment. The results show that TEAC is comparable with the state-of-the-art solutions regarding the stability and sufficient exploration.

2. PRELIMINARIES

A RL problem can be modeled as a standard Markov decision process (MDP), which is represented as a tuple S, A, r, p, p 0 , γ . S and A denote the state space and the action space, respectively. p 0 (s) denotes the initial state distribution. At time t, the agent in state s t selects an action a t according to the policy π(a|s), in which the performance of the state-action pair is quantified by the reward function r(s t , a t ) and the next state of the agent is decided by the transition probability as s t+1 ∼ p(s t+1 |s t , a t ). The goal of the agent is to find the optimal policy π(a|s) to maximize the expected reward E s0,a0,... [ ∞ t=0 γ t r (s t , a t )], where s 0 ∼ p 0 (s) and s t+1 ∼ p(s t+1 |s t , a t ). γ is a discount factor (0 < γ < 1)) which quantifies how much importance we give for future rewards.

