TEAC: INTEGRATING TRUST REGION AND MAX ENTROPY ACTOR CRITIC FOR CONTINUOUS CONTROL

Abstract

Trust region methods and maximum entropy methods are two state-of-the-art branches used in reinforcement learning (RL) for the benefits of stability and exploration in continuous environments, respectively. This paper proposes to integrate both branches in a unified framework, thus benefiting from both sides. We first transform the original RL objective to a constraint optimization problem and then proposes trust entropy actor-critic (TEAC), an off-policy algorithm to learn stable and sufficiently explored policies for continuous states and actions. TEAC trains the critic by minimizing the refined Bellman error and updates the actor by minimizing KL-divergence loss derived from the closed-form solution to the Lagrangian. We prove that the policy evaluation and policy improvement in TEAC is guaranteed to converge. We compare TEAC with 4 state-of-the-art solutions on 6 tasks in the MuJoCo environment. The results show that TEAC with optimized parameters achieves similar performance in half of the tasks and notably improvement in the others in terms of efficiency and effectiveness.

1. INTRODUCTION

With the use of high-capacity function approximators, such as neural networks, reinforcement learning (RL) becomes practical in a wide range of real-world applications, including game playing (Mnih et al., 2013; Silver et al., 2016) and robotic control (Levine et al., 2016; Haarnoja et al., 2018a) . However, when dealing with the environments with continuous state space or/and continuous action space, most existing deep reinforcement learning (DRL) algorithms still suffer from unstable learning processes and are impeded from converging to the optimal policy. The reason for unstable training process can be traced back to the use of greedy or -greedy policy updates in most algorithms. With the greedy update, a small error in value functions may lead to abrupt policy changes during the learning iterations. Unfortunately, the lack of stability in the training process makes the DRL unpractical for many real-world tasks (Peters et al., 2010; Schulman et al., 2015; Tangkaratt et al., 2018) . Therefore, many policy-based methods have been proposed to improve the stability of policy improvement (Kakade, 2002; Peters & Schaal, 2008; Schulman et al., 2015; 2017) . Kakade (2002) proposed a natural policy gradient-based method which inspired the design of trust region policy optimization (TRPO). The trust region, defined by a bound of the Kullback-Leibler (KL) divergence between the new and old policy, was formally introduced in Schulman et al. (2015) to constrain the natural gradient policy changing within the field of trust. An alternative to enforcing a KL divergence constraint is to utilize the clipped surrogate objective, which was used in Proximal Policy Optimization (PPO) (Schulman et al., 2017) to simplify the objective of TRPO while maintaining similar performance. TRPO and PPO have shown significant performance improvement on a set of benchmark tasks. However, these methods are all on-policy methods requiring a large number of on-policy interaction with environment for each gradient step. Besides, these methods focus more on the policy update than exploration, which is not conducive to finding the global optimal policy. The globally optimal behavior is known to be difficult to learn due to sparse rewards and insufficient explorations. In addition to simply maximize the expected reward, maximum entropy RL (MERL) (Ziebart et al., 2008; Toussaint, 2009; Haarnoja et al., 2017; Levine, 2018) proposes to extend the conventional RL objective with an additional "entropy bonus" argument, resulting in the preferences to the policies with higher entropy. The high entropy of the policy explicitly encourages exploration, thus improving the diverse collection of transition pairs, allowing the policy to capture multi-modes of good policies, and preventing from premature convergence to local optima. MERL reforms the reinforcement learning problem into a probabilistic framework to learn energy-based policies to maintain the stochastic property and seek the global optimum. The most representative methods in this category are soft Q-learning (SQL) (Haarnoja et al., 2017) and Soft Actor Critic (SAC) (Haarnoja et al., 2018b; c) . SQL defines a soft Bellman equation and implements it in a practical off-policy algorithm which incorporates the entropy of the policy into the reward to encourage exploration. However, the actor network in SQL is treated as an approximate sampler, and the convergence of the method depends on how well the actor network approximates the true posterior. To address this issue, SAC extends soft Q-learning to actor-critic architecture and proves that a given policy class can converge to the optimal policy in the maximum entropy framework. However, offpolicy DRL is difficult to stabilize in policy improvement procedure (Sutton & Barto, 1998; van Hasselt et al., 2018; Ciosek et al., 2019) which may lead to catastrophic actions, such as ending the episode and preventing further learning. Several models have been proposed to benefit from considering both the trust region constraint and the entropy constraint, such as MOTO (Akrour et al., 2016) , GAC (Tangkaratt et al., 2018) , and Trust-PCL (Nachum et al., 2018) . However, MOTO and GAC cannot efficiently deal with highdimensional action space because they rely on second-order computation, and Trust-PCL suffers from algorithm efficiency due to its requirement of trajectory/sub-trajectory samples to satisfy the pathwise soft consistency. Therefore, in this paper, we propose to further explore the research lines of unifying trust region policy-based methods and maximum entropy methods. Specifically, we first transform the RL problem into a primal optimization problem with four additional constraints to 1) set an upper bound of KL divergence between the new policy and the old policy to ensure the policy changes are within the region of trust, 2) provide a lower bound of the policy entropy to prevent from a premature convergence and encourage sufficient exploration, and 3) restrain the optimization problem as a Markov Decision Process (MDP). We then leverage the Lagrangian duality to the optimization problem to redefine the Bellman equation which is used to verify the policy evaluation and guarantee the policy improvement. Thereafter, we propose a practical trust entropy actor critic (TEAC) algorithm, which trains the critic by minimizing the refined Bellman error and updates the actor by minimizing KL-divergence loss derived from the closed-form solution to the Lagrangian. The update procedure of the actor involves two dual variables w.r.t. the KL constraint and entropy constraint in the Lagrangian. Based on the Lagrange dual form of the primal optimization problem, we develop gradient-based method to regulate the dual variables regarding the optimization constraints. The key contribution of the paper is a novel off-policy trust-entropy actor-critic (TEAC) algorithm for continuous controls in DRL. In comparison with existing methods, the actor of TEAC updates the policy with the information from the old policy and the exponential of the current Q function, and the critic of TEAC updates the Q function with the new Bellman equation. Moreover, we prove that the policy evaluation and policy improvement in trust entropy framework is guaranteed to converge. A detailed comparison with similar work, including MOTO (Akrour et al., 2016) , GAC (Tangkaratt et al., 2018) , and Trust-PCL (Nachum et al., 2018) , is provided in Sec. 4 to explain that TEAC is the most effective and most theoretically complete method. We compare TEAC with 4 state-of-the-art solutions on the tasks in the MuJoCo environment. The results show that TEAC is comparable with the state-of-the-art solutions regarding the stability and sufficient exploration.

2. PRELIMINARIES

A RL problem can be modeled as a standard Markov decision process (MDP), which is represented as a tuple S, A, r, p, p 0 , γ . S and A denote the state space and the action space, respectively. p 0 (s) denotes the initial state distribution. At time t, the agent in state s t selects an action a t according to the policy π(a|s), in which the performance of the state-action pair is quantified by the reward function r(s t , a t ) and the next state of the agent is decided by the transition probability as s t+1 ∼ p(s t+1 |s t , a t ). The goal of the agent is to find the optimal policy π(a|s) to maximize the expected reward E s0,a0,... [ ∞ t=0 γ t r (s t , a t )], where s 0 ∼ p 0 (s) and s t+1 ∼ p(s t+1 |s t , a t ). γ is a discount factor (0 < γ < 1)) which quantifies how much importance we give for future rewards. The state-action value function Q π (s t , a t ) and the value function V π (s t ) are then defined as: Q π (s t , a t ) = E st+1,at+1,... ∞ l=0 γ l r (s t+l , a t+l ) , V π (s t ) = E at,st+1,... ∞ l=0 γ l r (s t+l , a t+l ) . For the continuous environments, which is the focus of this paper, S and A denote finite dimensional real valued vector spaces, s denotes the real-valued state vector, and a denotes the real-valued action vector. The expected reward can be defined as : J (π) = E (s,a)∼ρπ(s,a) [Q π (s, a)] = E ρπ(s)π(a|s) [Q π (s, a)] , where ρ π (s) and ρ π (s, a) denote the (discounted) state and (discounted) state-action marginals of the trajectory distribution induced by a policy π(a|s).foot_0 

3. OUR METHOD

This section explains the details and features of the TEAC framework with the focus on the mathematical deductions and proofs of the guaranteed policy improvement and convergence in an actorcritic architecture.

3.1. PRIMAL AND DUAL OPTIMIZATION PROBLEM

To stabilize the training process and steer the exploration, in addition to simply maximizing the expected reward with ( -) greedy policy updates, we propose to 1) confine the KL-divergence between neighboring policies in the training procedure to avoid large-step policy updates, and 2) favor a stochastic policy with relatively larger entropy to avoid premature convergence due to insufficient exploration. Therefore, we define the RL problem as a primal optimization problem with additional constraints, given as: max π E ρπ(s)π(a|s) [ Q(s, a)], subject to E ρπ(s) [KL (π(•|s) π old (•|s))] ≤ τ, E ρπ(s) [H(π(•|s))] ≥ η, E ρπ(s) π(a|s)da = 1, E ρπ(s)π(a|s)p(s |s,a) V (s ) = E ρπ(s ) V (s ), where Q(s, a) is a critic estimating the state-action value function whose parameter is learned such that Q(s, a) ≈ Q π (s, a), π(•|s) is the policy distribution to be learned, π old (•|s) is the prior policy distribution, and V (s ) is a state feature function estimating the state value function of the next state. The term KL (π(•|s) π old (•|s)) = E π(a|s) [log π(a|s) -log π old (a|s)] confines the KL-divergence between the distributions of the new and old policies. The third constraint ensures that the stateaction marginal of the trajectory distribution is a proper probability density function. As the state marginal of the trajectory distribution needs to comply with the policy π(a|s) and the system dynamics p(s |s, a), i.e., ρ π (s)π(a|s)p(s |s, a) = ρ π (s ), meanwhile the direct matching of the state probabilities is not feasible in continuous state spaces, the use of V (s ) in the fourth constraint which can be also considered as state features, helps to focus on matching the feature averages. These last two constraints formally restrain the optimization problem within a MDP framework. The objective is to maximize the expected reward of a policy while ensuring it satisfies the lower bound of entropy and upper bound of distance from the previous policy. The constraint of KLdivergence term helps to avoid the abrupt difference between the new and old policies, while the constraint of the entropy term helps to promote the policy exploration. The entropy constraint is crucial in our optimization problem for two reasons: 1) Prior studies show that the use of KL-bound leads to a rapid decrease of the entropy, thus bounding the entropy helps to lower the risk of premature convergence induced by the KL-bound; 2) Each iteration of policy update will modify the critic Q(s, a) and the state distribution ρ π (s), thus changing the optimization landscape of the policy parameters. The entropy constraint ensures the exploration in the action space in case of evolving optimization landscapes. The Lagrangian of this optimization problem is denoted as: L(π, α, β, λ, ν) =E ρ(s)π(a|s) [ Q(s, a)] + α τ -E ρ(s) [KL (π(•|s) π old (•|s))] + β E ρ(s) [H(π(•|s))] -η + λ E ρ(s) π(a|s)da -1 + ν(E ρ(s)π(a|s)p(s |s,a) V (s ) -E ρ(s ) V (s )), where α, β, λ, ν are the dual variables, and for the sake of brevity, we use ρ(s) to represent ρ π (s). Eq. 3 is a super set of trust region and maximum entropy methods. That is, β = 0 leads to an equivalent objective function as the standard trust region, while α = 0, which indicates that the KL-divergence bound is not active, leads to a maximum entropy RL objective that SAC tries to solve. Take derivative of L w.r.t. π and set the derivative to zero: ∂ π L =E ρ(s) Q(s, a) -(α + β) log π(a|s) + α log π old (a|s) -ν V (s)+ E p(s |s,a) [ν V (s )] da -(α + β + λ) = Q(s, a) -(α + β) log π(a|s) + α log π old (a|s) -ν V (s) + E p(s |s,a) [ν V (s )] -(α + β + λ) =0. (4) Continuous problem domains require a practical approximation to the policy update function. We use neural networks as function approximators to parameterize the policy and Q function. Specifically, the Q function, known as critic, is modeled as expressive neural networks Q φ (s, a), and we follow Lillicrap et al. (2016) to build a target critic network Q φ which mitigates the challenge of overestimation. Meanwhile, the policy, known as actor, is parameterized by π θ (•|s) as a Gaussian with mean and covariance given by neural networks, and we also build up another neural network π θ (•|s) with the same architecture as π θ to enable us to facilitate policy learning by leveraging the "old" policy within our framework.

3.2. CRITIC UPDATE

Given the fact that we sample actions from the actor network as parameterized Gaussian distribution and the value function should satisfy Bellman equationfoot_1 , V (s) = Q(s, a) -(α + β) log π(a|s) + α log π old (a|s) + E p(s |s,a) [ V (s )] -(α + β + λ) (5) The last constant term in Eq.5 can be ignored as it does not affect the Bellman iteration when neural networks are used to approximate the value function. Therefore, the Bellman equation can be redefined in our framework. According to Eq.5, we could compute the value of a fixed policy π. Starting from any function Q : S × A → R, we define our modified Bellman backup operator. Definition 3.1 Bellman Equation. A modified Bellman backup operator T π is defined as T π Q(s, a) r + γE p(s |s,a) [V (s )], where V (s) = E π(a|s) [Q(s, a) -(α + β) log π(a|s) + α log π old (a|s)] ) is the trust entropy state-value function in our framework. In the sequel, Q(s, a) stands for the state-action value function obtained by iteratively applying the modified Bellman backup operator in Eq.6 and Eq.7, which is the trust entropy Q-value in our framework. Meanwhile, the policy evaluation can be verified accordingly.

Lemma 1 (Trust Entropy Policy Evaluation

). Let Q k+1 = T π Q k , the sequence Q k will converge to the trust entropy Q-value of π as k → ∞ when considering a mapping Q 0 : S × A → R with |A| < ∞ and the Bellman backup operator T π . Proof. See Appendix A.1. The learning of Q function can utilize off-policy methods to acquire high sample efficiency. Hence, the parameters can be trained by minimizing the squared Bellman error, given as: L Q (φ) = E (s,a)∼D 1 2 (Q φ (s, a) -y) 2 , ( ) where D is the replay buffer, Q φ (s, a) represents the Q network (also known as critic network) which is parameterized by φ, and y = r (s, a) + γE s ∼p V φ (s ) , where V φ denotes the value function obtained from the target critic network Q φ with Eq. 7. The inclusion of the target critic network helps to stabilize the training. As suggested in Lillicrap et al. (2016) , φ is updated via φ ← κφ + (1 -κ) φ where 0 ≤ κ ≤ 1. We set κ = 0.005 in the experiments. The expected squared Bellman error is computed with samples drawn from the replay buffer using mini-batch. Thus, the approximate gradient of the squared Bellman error L Q (φ) w.r.t. φ is: ∇φ L Q (φ) =∇ φ Q φ (a, s) (Q φ (s, a) -(r(s, a) + γ Q φ (s , a ) - (α + β) log π θ (a|s ) + α log π θ (a|s ))), where θ and θ are parameters of current policy and old policy respectively, and a is sampled from current policy π θ given s .

3.3. ACTOR UPDATE

Setting ν = 0 in Eq.4 in fact does not change the optimization problem. Therefore, a closed-form solution regarding the policy is given as: π(a|s) = π old (a|s) α α+β exp Q(s, a) α + β exp - α + β + λ α + β ∝ π old (a|s) α α+β exp Q(s, a) α + β , where exp -α+β+λ α+β is the normalization term of π(a|s) (The detailed derivation is provided in A.2). It should be noted that MORE (Daniel et al., 2016) , MOTO (Akrour et al., 2016) , GAC (Tangkaratt et al., 2018) , and Trust-PCL (Nachum et al., 2018) can also been viewed as prior work stemming from Eq.10. However, it is infeasible to use Eq.10 to directly update the policy given the fact that we cannot guarantee that the resulting policy remains in the same policy class when weighing the old policy with the exponential of Q function without any assumption. Different strategies have been applied in the prior work to address this issue, and the detailed discussion is provided in Sec. 4. To improve the tractability of policies, as Haarnoja et al. (2018c) , we require the policy is selected from a set of policies π ∈ Π, which is a parameterized Gaussian distribution family. This is guaranteed by the use of the Kullback-Leibler divergence to ensure the improved policy locates in the same policy set. Since the normalization term exp -α+β+λ α+β is intractable and does not contribute to the gradient of the new policy, it can be ignored. Therefore, the policy is updated by L π (θ) = E s∼D D KL π θ (a|s) π θ (a|s) α α+β exp Q(s, a) α + β , ( ) where D is the replay buffer, π θ represents the parameterized policy, π θ represents the parameterized old policy, and a in Q is sampled from the current policy π θ . In practice, the old policy network equals to the policy network in the last iteration. Therefore, we could leverage another actor network to keep the old policy by copying θ to θ after computing the loss function of the policy and before the back propagation in each iteration (The detailed algorithm is provided in Appendix C). With the assumption of policy being Gaussian, the policy improvement can be guaranteed in our framework. Lemma 2 (Trust Entropy Policy Improvement). Given a policy π and an old policy π, define a new policy π(a|s) ∝ π(a|s) α α+β exp Q π (s, a) α + β , ∀s. ( ) If Q is bounded, π(a|s) α α+β exp Q π (s,a) α+β da is bounded for any s (for all π, π and π), and the policies are the parameterized Gaussian networks. Then we can obtain Q π (s, a) ≥ Q π (s, a), ∀s, a. Proof. See Appendix A.3. In other words, in the policy improvement, we use the information from the old policy and exponential of the Q function induced by the current policy to derive the policy of next iteration. Because the Q function is a non-linear function approximation parameterized by neural networks and can be differentiated, the reparameterization trick a = f θ (ξ; s), where ξ t is an input noise vector sampled from standard normal distribution, can be applied. Then, the approximate gradient of L π (θ) w.r.t. θ is given as: ∇θ L π (θ) =∇ θ (α + β) log (π θ (a|s)) -∇ a Q φ (s, a)∇ θ f θ (ξ; s).

3.4. DUAL VARIABLES UPDATE

This section explains the updates of dual variables (α and β) throughout the entire framework. The dual function, which is derived by substituting π(a|s) in the Lagrangian (Eq. 3) with its form in Eq.10, is given as: g(α, β) = ατ -βη -(α + β) • E ρ(s) - α + β + λ α + β . ( ) As exp α+β+λ α+β is the normalization term of π(a|s), the dual function can be represented as: g(α, β) = ατ -βη + E ρ(s) [α • log π old (a|s) + Q(s, a) -(α + β) • log π(a|s)] . The approximate gradient of g(α, β) w.r.t. α and β are: ∇α g(α) = τ -log π θ (a|s) + log π θ (a|s), ( ) ∇β g(β) = -η -log π θ (a|s), which enable us to find the "proper" α and β with the gradient descent method, satisfying the KL and entropy constraints in Eq.2. The dual variable updates, along with the trust entropy Q function updates (Sec. 3.2) and trust entropy policy updates (Sec. 3.3), constitute the main components of our framework.

4. CONNECTION WITH PREVIOUS WORK

The most related methods to our work are MOTO (Akrour et al., 2016) , GAC (Tangkaratt et al., 2018) , and Trust-PCL (Nachum et al., 2018) as they also consider both the trust region constraint and the entropy constraint. MORE (Daniel et al., 2016) considers the two constraints in the domain of stochastic search optimization. MOTO (Akrour et al., 2016) extends MORE to the sequential decision making domain. In MOTO, Q function is estimated by using a quadratic surrogate function of the state and action space, and the policy of a log-linear Gaussian form is updated according to the KL-divergence bounding constraint and a variable lower bound of entropy determined by the policy of each iteration. GAC (Tangkaratt et al., 2018) further extends MOTO by 1) approximating the Q function with a truncated Taylor series expansion and parameterizing them with a deep neural network, and 2) learning log-nonlinear Gaussian form policies. In some sense, MOTO, GAC, and ours can be viewed as solving the optimization problem with the same constraints as stated in Eq.2. After leveraging the Lagrangian of the optimization problem, the corresponding closed form solution for the policy updating is shown in Eq.10. However, Eq.10 also indicates that the new policy is derived by weighing the old policy and the exponential of Q function (see the R.H.S of Eq.10), which may deviate the updated policy from the expected policy distribution class. Consequently, the KL constraint will be no longer preserved (Akrour et al., 2018) . These methods differ from each other by using different strategies to circumvent the issue. MOTO utilizes a quadratic Q function and assumes the policy is of log-linear Gaussian form. Consequently, MOTO can update the policy in a non-parameterized way. GAC adopts the similar strategy as MOTO to learn a non-parameterized Gaussian actor, and then uses this actor to guide a parameterized actor with supervised learning. However, it is hard for MOTO and GAC to deal with high-dimensional action space because they rely on second-order computation. In comparison, we redefine the Bellman equation and guarantee the policy improvement by updating policies with Eq.11. Therefore, our method resolves the challenge simply with a more general assumption of Gaussian policy class. Moreover, when dealing with the dual function g(α, β) = ατ -βη + (α + β)E ρ(s) log π old (a|s) α α+β exp Q(s, a) α + β da , where the integral term is intractable, MOTO and GAC reply on complex second-order computation to make it tractable. In contrast, we resolve the challenge by leveraging the policy to transform the dual function into a simpler form which can still be optimized in first-order computation, e.g., stochastic gradient descent method, with the policy improvement guaranteed. Trust-PCL (Nachum et al., 2018) addresses the challenge with a different perspective by integrating path consistency learning (PCL) (Nachum et al., 2017) , which is developed in the maximum entropy framework, and trust region policy optimization method. PCL, which is the base algorithm of trust-PCL, suggests that the optimal policy and state values should satisfy pathwise soft consistency property along any sampled trajectory, thus allowing the use of off-policy data. Consequently, the single-step temporal consistency of state-value function in Trust-PCL is V * (s t ) = E rt,st+1 [r t -(τ + λ) log π * (a t | s t ) + λ log π (a t+i | s t+i ) + γV * (s t+1 )] , which is similar to our state-value function. However, our method differs from Trust-PCL in 3 ways: 1) Trust-PCL focuses on updating the state-value function while our method focuses on updating the Q function. 2) Trust-PCL updates the policy directly with the temporal consistency squared error while we use Eq.11. 3) As each update iteration in Trust-PCL requires trajectory/sub-trajectory samples to satisfy the pathwise soft consistency, it significantly compromises the algorithm efficiency. In comparison, our method requires only state-action pairs for each update iteration and is capable of finding (sub-)optimal value for every dual variable in each update iteration. It should be noted that proximal policy optimization (PPO) (Schulman et al., 2017) can also achieve trust region constraint and encourage unpredictably actions by adding entropy bonus to its loss function. However, the entropy bonus in PPO is one-off effect which only considers the current state but no future states of the agent. In comparison, our method can be considered as maximum long-term entropy with constraint policy in trust region.

5. EXPERIMENTS

We experimented to investigate the following questions: (1) In comparison with state-of-the-art algorithms, does TEAC have a better performance in terms of sample efficiency and computational efficiency? (2) How should we choose hyperparameters τ and η and how can these two variables affect the performance? Figure 1 : Performance comparisons on six MuJoCo tasks trained for 3 million timesteps. The horizontal axis indicates number of environment steps. The vertical axis indicates the average return. We trained three different instances of each algorithm with different random seeds, with each instance performing an evaluation every 4,000 environment steps. The solid lines represent the mean and the shaded regions mark the minimum and maximum returns over the three trials. We set η as the negative of action space dimension of the task, and set τ = 0.005 for all tasks.

5.1. SETUP

We experimented the continuous control tasks available from the MuJoCo environment (Todorov et al., 2012) . We compared our method TEAC with 1) proximal policy optimization (PPO) (Schulman et al., 2017) , a stable and effective on-policy policy gradient algorithm; 2) SAC (Haarnoja et al., 2018c) , the state-of-the-art off-policy algorithm for learning maximum entropy policies whose temperature is adjusted automatically; 3) Trust-PCL (Nachum et al., 2018) , an off-policy method optimizing maximum entropy RL objective with trust region; and 4) GAC (Tangkaratt et al., 2018) , an offpolicy method utilizing second-order information of critic. For Trust-PCL and GAC, we used their original implementation provided by their authorsfoot_2 foot_3 foot_4 . For PPO and SAC, we used the implementation publicly provided by OpenAIfoot_5 , and adapted SAC to its automatically adjusting version in Haarnoja et al. (2018c) . For convenience, we developed our algorithm based on spinningup version of SACfoot_6 . The pseudo-code of our method is provided in Appendix C and the source code is available at https://github.com/ICLR2021papersub/TEAC. TEAC requires to specify hyperparameters τ , which represents the desired maximum KLdivergence, and η, which specifies the desired minimum entropy, before training. As the requirements of stability and exploration vary in different tasks, we set η as the negative of action space dimension of the task, and set τ = 0.005 for all tasks. The settings of effective hyperparameters are provided in Appendix D. Under review as a conference paper at ICLR 2021

5.2. RESULTS

Fig. 1 illustrates the training curve for each algorithm. In general, when τ = 0.005, TEAC has similar performance as SAC in simpler tasks which have lower dimension of actions, such as Hopper-v3, Walker2d-v3, and Ant-v3. However, for complex tasks with higher dimension of actions, such as Huamoid-v3, TEAC gains significant performance improvement. We also experimented and compared different settings of τ and η for their impact on the performance. As there are no reasonable approaches to set τ , we simply compared seven different value levels in the tasks. The results show that the selection of proper τ value for each task can significantly boost the performance of TEAC (see more details in Appendix B), and generally τ should be smaller for tasks with higher complexity. For example, when we set τ = 0.001 in Humanoid-v3, TEAC had more than 10% performance gain than that of τ = 0.005. The reason can be attributed to the stability of the algorithm. For complex tasks, the policy needs more exploration to find the global optima, resulting in larger update steps. Without the help of trust region constraint, the policy will explore arbitrarily in the policy space, losing its bearings and getting trapped in some bad settle-points. For η, as we are dealing with continuous distributions, the entropy can be negative (Abdolmaleki et al., 2016) . Hence, η should be a small value. We have investigated several heuristic approaches for setting η provided in MORE, GAC, and SAC, but none of them can serve as a general and effective solution (see more details in Appendix B). Thus, in our experiments, we simply set η as the negative of action space dimension similar to SAC.

6. CONCLUSION

In this paper, we propose to integrate two branches of research in RL, trust region methods for better stability and maximum entropy methods for better policy exploration during the learning, to benefit from both sides. We first transform the original RL objective to a constraint optimization problem with the constraints of upper bound KL-divergence to avoid the abrupt difference between the new and old policies and lower bound entropy to promote the policy exploration. Therefore, the Bellman equation is redefined accordingly to guide the system loss evaluation. Consequently, we introduce TEAC, an off-policy algorithm to learn stable and sufficiently explored policies for continuous states and actions. TEAC utilizes two Actor networks to achieve the policy improvement by leveraging the information from the old policy and the exponential of current Q function represented in the critic network. The results show that TEAC with optimized parameters achieves similar performance in half of the tasks and notably improvement in the others in terms of efficiency and effectiveness.

APPENDIX A DERIVATIONS AND PROOFS

A.1 TRUST ENTROPY POLICY EVALUATION Lemma A.1 (Trust Entropy Policy Evaluation). Let Q k+1 = T π Q k , the sequence Q k will converge to the trust entropy Q-value of π as k → ∞ when considering a mapping Q 0 : S × A → R with |A| < ∞ and the Bellman backup operator T π . Proof. We define the reward function in trust entropy framework as r π (s, a) r(s, a) + (α + β) • E s ∼p [E α ∼π [π(•|s )]] -α • E s ∼p [E α ∼π [π old (•|s )]] . We rewrite the update rule as Q(s, a) ← r π (s, a) + γE s ∼p,a ∼π [Q(s , a )] . Following Sutton & Barto (1998) , we can realize the standard convergence for policy evaluation.

A.2 DERIVATION OF THE SOLUTION OF LAGRANGIAN

By taking derivative of L w.r.t. π and setting the derivative to zero, ∂ π L =E ρ(s) Q(s, a) -(α + β) log π(a|s) + α log π old (a|s) -ν V (s)+ E p(s |s,a) [ν V (s )] da -(α + β + λ) = Q(s, a) -(α + β) log π(a|s) + α log π old (a|s) -ν V (s) + E p(s |s,a) [ν V (s )] -(α + β + λ) =0 (22) Given the fact that we sample actions from the actor network as parameterized Gaussian distribution and the value function should satisfy Bellman equation, if we set ν = 0, the function can be rewritten as 0 = Q(s, a) -(α + β) log π(a|s) + α log π old (a|s) -(α + β + λ) The solution of π(a|s) is: π(a|s) = π old (a|s) α α+β • exp Q(s, a) α + β • exp - α + β + λ α + β . Here, we combine the third constraint in Eq. 2 with the solution Eq. 24: 1 =E ρ(s) π(a|s)da =E ρ(s) π old (a | s) α α+β • exp Q(s, a) α + β • exp - α + β + λ α + β da . Given the fact that α, β, and λ are constants which are independent of s and a, we can get 1 =E ρ(s) π old (a | s) α α+β • exp Q(s, a) α + β • exp - α + β + λ α + β da =E ρ(s) π old (a | s) α α+β • exp Q(s, a) α + β da • exp - α + β + λ α + β , Thus, exp - α + β + λ α + β -1 = E ρ(s) π old (a | s) α α+β • exp Q(s, a) α + β da . Hence, the term exp α+β+λ α+β acts as a normalization term. Therefore, π(a|s) ∝ π old (a|s) α α+β exp Q(s, a) α + β . A.3 TRUST ENTROPY POLICY IMPROVEMENT Lemma A.2 (Trust Entropy Policy Improvement). Given a policy π and an old policy π, define a new policy π(a|s) ∝ π(a|s) α α+β exp Q π (s, a) α + β , ∀s. ( ) If Q is bounded, π(a|s) α α+β exp Q π (s,a) α+β da is bounded for any s (for all π, π and π), and the policies are the parameterized Gaussian networks. Then we can obtain Q π (s, a) ≥ Q π (s, a), ∀s, a. Proof. With the definition of the V function, we can get: V π (s) = E τ ∼π,s0=s,a0=a ∞ t=0 γ t (r(s t , a t ) -(α + β) • log π(•|s t ) + α • log π(•|s t )) Here, τ = (s 0 , a 0 , s 1 , a 1 , . . . ) denotes the trajectory originating at (s, a). Using a telescoping argument, we have: V π (s) -V π (s) =E τ ∼π,s0=s,a0=a ∞ t=0 γ t (r(s t , a t ) -(α + β) • log π(•|s t ) + α • log π(•|s t )) -V π (s) =E τ ∼π,s0=s,a0=a [ ∞ t=0 γ t (r(s t , a t ) -(α + β) • log π(•|s t ) + α • log π(•|s t ) + V π (s t ) -V π (s t ))] -V π (s) (a) = E τ ∼π,s0=s,a0=a [ ∞ t=0 γ t (r(s t , a t ) -(α + β) • log π(•|s t ) + α • log π(•|s t ) + γV π (s t+1 ) -V π (s t ))] (b) =E τ ∼π,s0=s,a0=a [ ∞ t=0 γ t (r(s t , a t ) -(α + β) • log π(•|s t ) + α • log π(•|s t ) + γE[V π (s t+1 )|s t , a t ] -V π (s t ))] =E τ ∼π,s0=s,a0=a [ ∞ t=0 γ t (Q π (s t , a t ) -V π (s t ) -(α + β) • log π(•|s t ) + α • log π(•|s t ))] = 1 1 -γ E s ∼ρπ(s) E a∼π(•|s) [γ t (Q π (s , a) -V π (s ) -(α + β) • log π(•|s ) + α • log π(•|s ))], where (a) rearranges terms in the summation and cancels the V (s 0 ) term with the -V (s) outside the summation, and (b) uses the tower property of conditional expectations and the final equality follows from the definition of ρ π (s). Consider Eq.24 can be rewritten as: π(a|s) = exp α α + β log π(a|s) + Q π (s, a) α + β exp - α + β + λ α + β , where exp -α+β+λ α+β is normalization term. Assume we follow the gradient ascent update rule and that the distribution ρ(s) is strictly positive i.e. ρ(s) > 0 for all states s. Following the work Agarwal et al. (2020) , with the help of the gradient of the softmax policy class, we can get a∈A π(a|s)(Q(s, a) -V (s) -(α + β) • log π(•|s) + α • log π(•|s)) ≥ 0, Then we get V π (s) ≥ V π (s), as well as Q π (s, a) ≥ Q π (s, a) holds for all states s and actions a.

A.4 DERIVATION OF THE DUAL FUNCTION

To obtain the dual function, we take the solution of π(a|s) into Eq. 3: L(π, α, β, λ) =E ρ(s)π(a|s) [ Q(s, a)] + α τ -E ρ(s) [KL (π(•|s) π old (•|s))] + β E ρ(s) [H(π(•|s))] -η + λ E ρ(s) π(a|s)da -1 =E ρ(s)π(a|s) [ Q(s, a)] -(α + β)E ρ(s)π(a|s) Q(s, a) α + β + α α + β log π old (a|s) - α + β + λ α + β + αE ρ(s)π(a|s) [log π old (a|s)] + λ E ρ(s) π(a|s)da -1 + ατ -βη =ατ -βη -(α + β) • E ρ(s) - α + β + λ α + β . This loss function can be rewritten as L(α, β) =ατ -βη + (α + β)E ρ(s) log exp α + β + λ α + β =ατ -βη + (α + β)E ρ(s) log π old (a|s) α α+β exp Q(s, a) α + β da =g(α, β) Meanwhile, we can rewrite Eq.24 as exp α + β + λ α + β = π old (a|s) α α+β • exp Q(s,a) α+β π(a|s) With Eq.36, the loss function becomes Figure 2 : Performance with different τ on six MuJoCo tasks. τ = 0.001 achieves about 8000 as the return in Humanoid-v3, and τ = 0.005 achieves about 130 as the return in Swimmer-v3. These two results surpass all benchmark algorithms significantly. L(α, β) =ατ -βη + (α + β)E ρ(s)   log   π old (a|s) α α+β • exp Q(s,a) α+β π(a|s)     =ατ -βη + (α + β)E ρ(s) α α + β log π old (a|s) + Q(s, a) α + β -log π(a|s) =ατ -βη + E ρ(s) α • log π old (a|s) + Q(s, a) -(α + β) • log π(a|s) . The impact of τ on the performance of TEAC was evaluated with τ ∈ (0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1). Fig. 5 shows that tuning τ for different tasks may achieve significant performance improvement in TEAC. Similar to SAC (Haarnoja et al., 2018c) , we set η to the negative of action space dimension in our own experiments. Besides, there are other techniques in existing methods. MORE (Abdolmaleki et al., 2016) changes the entropy constraint to E -E 0 ≥ γ (E old -E 0 ) ⇒ η = γ (E old -E 0 ) + E 0 where E ≈ E ρ(s) [H (π θ (a|s))] denotes the expected entropy of the current policy, E old ≈ E ρ(s) H π θ (a|s) denotes the expected entropy of the old policy, and E 0 denotes the entropy of a base policy N (a|0, 0.01I). GAC improves this technique to adjust η heuristically by η = max (γ (E -E 0 ) + E 0 , E 0 ) . We compared these three different techniques on Hopper-v3, Humanoid-v3 and Ant-v3. Fig. 3 shows that there is no outstanding difference from them. Therefore, we simply set η as the negative of action space dimension similar to SAC. Figure 5 : Performance comparisons on six MuJoCo tasks. We trained six different instances of all algorithms with different random seeds. In this case, for TEAC, we set η as the negative of action space dimension of the task, and set τ = 0.005 for all tasks.



Following Sutton et al. (2000), we use ρπ in the paper to implicate that ρπ is the stationary distribution of states under π and independent of s0 for all policies. we could utilize any form of state features V (s ). Thus, ν V (s ) can be seen as another form of state features. Therefore, ν can be arbitrary Trust-PCL code: https://github.com/tensorflow/models/ tree/master/research/pcl rl GAC code: https://github.com/voot-t/guide-actor-critic Due to the second-order computation complexity, we only finished testing GAC on HalfCheetah-v3 and Hopper-v3 at the time of paper submission. The experimental results show that we can achieve better performance than GAC in much shorter running time. We will complete the comparison before the rebuttal begins. OpenAI spinningup code: https://github.com/openai/spinningup https://github.com/openai/spinningup/tree/master/spinup/algos/pytorch/sac Our implementation also makes use of two Q-functions (critic networks) to mitigate positive bias in the policy improvement step , followsFujimoto et al. (2018) andHaarnoja et al. (2018b) .



Figure 3: Performance with different η in three Mujoco tasks with τ = 0.1.

Figure 4: Performance comparisons on six MuJoCo tasks. Notice that the blue line is the performance of our model which setting different τ with respect to different tasks. In this figure, we set τ = 0.5 for HalfCheetah-v3, τ = 0.05 for Ant-v3, τ = 0.1 for Hopper-v3, τ = 0.001 for Humanoid-v3, τ = 0.005 for Swimmer-v3, and τ = 0.1 for Walker2d-v3.

Initial actor π θ (a|s), old actor π θ (a|s), critic Q φ1 and Q φ2 8 , target critic network φ1 ← φ 1 , φ2 ← φ 2 , KL divergence bound τ , entropy bound η, learning rate ω ac , ω α , ω β , an empty replay pool D = ∅ 1: for each iteration do Observe state s t and sample action a t ∼ π θ (a t |s t ) Execute a t , receive reward r(s t , a t ) and next state s t ∼ p(s t |s t , a t ) Add transition to replay buffer D ← D ∪ {(s t , a t , r (s t , a t ) , s t )} lists the effective hyperparameters of TEAC used in the experiments, of which the results are shown in Fig. 1.

annex

end for 7:for each gradient step do 8:Sample N mini-batch samples {(s i , a i , r i , s i )} N i=1 uniformly from D

9:

Sample actions a i ∼ π θ (a|s i ), compute Q tar (s i , a i ):Compute y i , update φ by, e.g., Adam, and update φ by moving average:Sample actions ã ∼ π θ (a|s i ), compute Q(s i , ã):Compute loss function of θ:Update θ using θ ← θ 14:Update θ by, e.g., Adam:Compute loss function of dual variables α and β:Update dual variables α and β:end for 18: end for

