ERROR CONTROLLED ACTOR-CRITIC METHOD TO REINFORCEMENT LEARNING

Abstract

In the reinforcement learning (RL) algorithms which incorporate function approximation methods, the approximation error of value function inevitably causes overestimation phenomenon and impacts algorithm performances. To mitigate the negative effects caused by approximation error, we propose a new actor-critic algorithm called Error Controlled Actor-critic which ensures confining the approximation error in value function. In this paper, we derive an upper boundary of approximation error for Q function approximator in actor-critic methods, and find that the error can be lowered by keep new policy close to the previous one during the training phase of the policy. The results of experiments on a range of continuous control tasks from OpenAI gym suite demonstrate that the proposed actor-critic algorithm apparently reduces the approximation error and significantly outperforms other model-free RL algorithms.

1. INTRODUCTION

Reinforcement learning (RL) algorithms are combined with function approximation methods to adapt to the application scenarios whose state spaces are combinatorial, large, or even continuous. Many function approximation methods RL methods, including the Fourier basis (Konidaris et al., 2011) , kernel regression (Xu, 2006; Barreto et al., 2011; Bhat et al., 2012) , and neural neworks (Barto et al., 1982; Tesauro, 1992; Boyan et al., 1992; Gullapalli, 1992) have been used to learn value functions. In recent years, many deep reinforcement learning (DRL) methods were implemented by incorporating deep learning into RL methods. Deep Q-learning Network (DQN) (Mnih et al., 2013) reported by Mnih in 2013 is a typical work that uses a deep convolutional neural network (CNN) to represent a suitable action value function estimating future rewards (returns); it successfully learned end-to-end control policies for seven Atari 2600 games directly from large state spaces. Thereafter, deep RL methods, such as Deep Deterministic Policy Gradient (DDPG) (Lillicrap et al., 2016) , Proximal Policy Optimization (PPO) (Schulman et al., 2017) , Twin Delayed Deep Deterministic policy gradient (TD3) (Fujimoto et al., 2018) , and Soft Actor-Critic (SAC) (Haarnoja et al., 2018) , started to become mainstream in the field of RL. Althouth function approximation methods have assisted reinforcement learning (RL) algorithms to perform well in complex problems by providing great representation power; however, they also cause an issue called overestimation phenomenon that jeopardize the optimization process of RL algorithms. Thrun & Schwartz (1993) presented a theoretical analysis of this systematic overestimation phenomenon in Q-learning methods that use function approximation methods. Similar problem persists in the actor-critic methods employed function approximation methods. Thomas (2014) reported that several natural actor-critic algorithms use biased estimates of policy gradient to update parameters when using function approximation to approximate the action value function. Fujimoto et al. (2018) proved that the value estimation in the deterministic policy gradient method also lead to overestimation problem. In brief, the approximation errors of value functions caused the inaccuracy of estimated values, and such inaccuracy induced the overestimation on value function; so that poor performances might be assigned to high reward values. As a result, policies with poor performance might be obtained. Previous works attempted to find direct strategies to effectively reduce the overestimation. Hasselt (2010) proposed Double Q-learning, in which the samples are divided into two sets to train two independent Q-function estimators. To diminish the overestimation, one Q-function estimator is used to select actions, and the other one is applied to estimate its value. Fujimoto et al. (2018) proposed mechanisms, including clipped double Q-learning and delayed policy updates, to minimize the overestimation. In contrast to these methods, we focus on actor-critic setting and manage to reduce the approximation error of value function, which is the source of the overestimation, in an indirect but effective way. We use the concepts of domain adaptation (Ben-David et al., 2010) to derive an upper boundary of the approximation error in Q function approximator. Then, we find that the least upper bound of this error can be obtained by minimizing the Kullback-Leibler divergence (KL divergence) between new policy and its previous one. This means minimizing the KL divergence when traning policy can stabilize the critic and then confine the approximation error in Q function. Interestingly, we arrive at similar conclusion as two literatures Geist et al. (2019) ; Vieillard et al. (2020) by a somewhat different route. In their works, the authors directly studied the effect of KL and entropy regularization in RL and proved that a KL regularization indeed leads to averaging errors made at each iteration of value function update. While our idea is very different from theirs: It is impracticable to minimize the approximation error directly, so instead we try to minimize an upper bound of approximation error. This is similar to Expectation-Maximization Algorithm (Bishop, 2006) which maximize a lower bound of log-likelihood instead of log-likelihood directly. We derive an upper boundary of approximation error for Q function approximatorin actor-critic methods, and arrive at a more general conclusion: approximation error can be reduced by keep new policy close to the previous one. Note that KL penalty is a effective way, but not the only way. Furthermore, the mentioned indirect operation (i.e. the KL penalty) can work together with the mentioned direct strategies for reducing overestimation, for example, clipped double Q-learning. Then, a new actor-critic method called Error Controlled Actor-critic (ECAC) is established by adopting an effective operation that minimizes the KL divergence to keep the upper bound as low as possible. In other words, this method ensures the similarity between every two consecutive polices in training process and reduces the optimization difficulty of value function, so that the error in Q function approximators can be decreased. Ablation studies were performed to examine the effectiveness of our proposed strategy for decreasing the approximation error, and comparative evaluations were conducted to verify that our method can outperform other mainstream RL algorithms. The main contributions of this paper are summarized as follow: (1) We presented an upper boundary of the approximation error in Q function approximator; (2) We proposed a practical actor-critic method-ECAC which decreases the approximation error by restricting the KL divergence between every two consecutive policies and adopt a mechanism to automatically adjust the coefficient of KL term.

2. PRELIMINARIES

2.1 REINFORCEMENT LEARNING Reinforcement learning (RL) algorithms are modeled as a mathematical framework called Markov Decision Process (MDP). In each time-step of MDP, an agent generates an action based on current state of its environment, then receives a reward and a new state from the environment. Environmental state and agent's action at time t are denoted as s t ∈ S and a t ∈ A, respectively; S and A denote the state and action spaces respectively, which may be either discrete or continuous. Environment is described by a reward function, r(s t , a t ), and a transition probability distribution, P r(s t+1 = s |s t = s, a t = a). Transition probability distribution specifies the probability that the environment will transition to next state. Initial state distribution is denoted as P r 0 (s). Let π denotes a policy, η(π) denotes its expected discounted rewards: η(π) = E π [R 1 + γR 2 + γ 2 R 3 + • • • ] = E π [ ∞ t=0 γ k R t+1 ], where γ denotes a discount rate and 0 ≤ γ ≤ 1. The goal of RL is to find a policy, π * , that maximizes a performance function over policy, J(π), which measures the performance of policy: π * = arg max π J(π). A natural form of J(π) is η(π). Different interpretations of this optimization goal lead to different routes to the their solutions. Almost all reinforcement learning algorithms involve estimating value functions, including statevalue and action-value functions. State-value function, V π (s), gives the expected sum of discounted reward when starting in s and following a given policy, π. V π (s) specified by: V π (s) = E π [ ∞ k=0 γ k R t+k+1 |s t = s]. Similarly, action-value function, Q π (s, a), is given by: Q π (s, a) = E π [ ∞ k=0 γ k R t+k+1 |s t = s, a t = a].

2.2. ACTOR-CRITIC ARCHITECTURE

To avoid confusion, by default, we discuss only RL methods with function approximation in this section. RL methods can be roughly divided into three categories: 1) value-based, 2) policy-based, and 3) actor-critic methods. Value-based method only learn value functions (state-value or action-value functions), and have the advantage of fast convergence. Policy-based methods primarily learn parameterized policies. A parameterized policy (with parameter vector, θ) is either a distribution over actions given a state, π θ (a|s), or a deterministic function, a = π θ (s). Their basic update is θ n+1 = θ n + α∇J(θ n ) where is learning rate. Policy based methods show better convergence guarantees but have high variance in gradient estimates. Actor-critic methods learn both value functions and policies and use value functions to improve policies. In this way, they trade off small bias in gradient estimates to low variance in gradient estimates. Actor-critic architecture (Peters & Schaal, 2008; Degris et al., 2013; Sutton & Barto, 2018 ) consists of two components: actor and critic modules. Critic module learns learns state-value function, V φ (s), or action-value function, Q φ (s, a) or both of them, usually by temporal-difference (TD) methods. Actor module learns a stochastic policy, π θ (a|s), or a deterministic policy, a = π θ (s), and utilizes value function to improve the policy. For example, in actor module of DDPG (Lillicrap et al., 2016) , the policy is updated by using the following performance function J(θ) = E π θ [Q φ (s t , π θ (s t ))], where π θ (s t ) is a deterministic policy.

2.3. DOMAIN ADAPTATION

Domain adaptation is a task which aims at adapting a well performing model from a source domain to a different target domain. It is used to describe the task of critic module in section 3.2. The learning task of critic module is viewed as adapting a learned Q function approximator to next one, and the target error equates to the approximation error at current iteration of critic update. Here, we present some concepts in domain adaptation, including domain, source error, and target error. Domain is defined as a specific pair consisting of a distribution, P , on an input space, X , and a labeling function, f : X → R. In domain adaption, source and target domains are denoted as P S , f S and P T , f T , respectively. A function, h : X → R, is defined as a hypothesis. Source error is the difference between a hypothesis, h(x), and a labeling function of source domain, f S (x), on a source distribution which is denoted as follow: e S (h, f S ) = E x∼P S [|h(x) -f S (x)|]. ( ) Target error is the difference between a hypothesis, h(x), and a labeling function of target domain, f T (x), on a target distribution which is denoted as follow: e T (h, f T ) = E x∼P T [|h(x) -f T (x)|]. For convenience, we use the shorthand e S (h) = e S (h, f S ) and e T (h) = e T (h, f T ).

3. ERROR CONTROLLED ACTOR-CRITIC

To reduce the impact of the approximation error, we propose a new actor-critic algorithm called error controlled actor-critic (ECAC). We present the details of a version of ECAC for continous control tasks, and, more importantly, explain the rationale for confining the KL divergence. In section 3.2, we will show that, at each iteration of critic update, the target error equals to the approximation error of Q function approximator. Then, we derive an upper boundary of the error, and find that the error can be reduced by limiting the KL divergence between every two consecutive policies. Although this operation is conducted when training the policy, it can indirectly reduce the optimization difficulty of Q function. Moreover, this indirect operation can work together with the strategies for diminishing overestimation phenomenon. We incorporate clipped double-Q learning into the critic module.

3.1. CRITIC MODULE-LEARNING ACTION VALUE FUNCTIONS

The learning task of critic module is to approximate Q functions. In the critic module of ECAC, two Q functions are approximated by two neural networks with weight φ (1) and φ (2) , respectively. As noted previously, we adopt clipped double-Q strategy (Fujimoto et al., 2018) to directly reduce overestimation. Furthermore, we adopt experience replay mechanism (Lin, 1992 )-agent's experiences at each timestep, (s t , a t , r t+1 , s t+1 ), are stored in a replay buffer, D; and training samples are uniformly drawn from this buffer. The two Q networks, Q φ (1) and Q φ (2) , are trained by using temporal-difference learning. Notice that the bootstrapping operation in this setting-uses function approximation-means to minimize the following two TD errors of Q funtion: δ (j) t = Q φ (j) (s t , a t ) -(R t+1 + γ min i=1,2 Q φ (i) (s t+1 , a t+1 )). Notice that clipped double-Q learning uses the smaller of the two Q values to form the TD error. With the minimum operator, it decreases the likelihood of overestimation by increasing the likelihood of underestimation. The two Q networks are respectively trained by minimizing the following two loss functions: L(φ (1) ) = E (s,a,r,s )∼P D s ∼P r(•|s,a), a ∼π(•|s ) [Q φ (1) (s, a) -(r + γ min i=1,2 Q φ (i) (s , a ))] 2 (9) L(φ (2) ) = E (s,a,r,s )∼P D s ∼P r(•|s,a), a ∼π(•|s ) [Q φ (2) (s, a) -(r + γ min i=1,2 Q φ (i) (s , a ))] 2 where D denotes replay buffer, P D denotes the distribution that describes the likelihoods of samples drawn from D uniformly, P r(•|s, a) denotes the transition probability distribution, and π denotes the target policy.

3.2. AN UPPER BOUND OF THE APPROXIMATION ERROR OF Q FUNCTION

For convenience, we analyze the setting with only one Q function. The concept of domain adaptation is used to describe the task of critic module. The learning task of critic module can be viewed as adapting the learned Q-network to a new Q function for newly learned policy. Thus, naturally, target error, i.e. Eq. ( 7), equates to the approximation error at current iteration of critic update. In this section, we derive an upper bound of the approximation error of Q function and find that the upper bound of approximation error will be smaller if the more similar the two consecutive policies. Fig 1 illustrates the training process of an actor-critic method which is a alternating process of value function. At (n + 1)-th iteration, the critic module tries to fit the value function, Q π θn , by means of π θn . But because of appriximation error, the actually obtained Q-network (approximator), Q φn+1 , is not equal to Q π θn . This can be expressed by the following equation: Q φn+1 (s, a) = Q π θn (s, a) + s,a n+1 , where s,a n+1 denotes the appriximation error in Q function given state, s, and action, a. This can be viewed as adapting the learned Q-network, Q φn , to the value function, Q π θn . Hence, the source distribution is P D n , and the target distribution is P D n+1 . As mentioned at the argument following Eq. ( 9) and ( 10), P D denotes the distribution that describes the likelihoods of samples drawn from replay buffer D uniformly. D n and D n+1 are replay buffers at n-th and (n + 1)-th iteration, respectively. Q φn is the labeling function. Clearly, the target error here equals to the approximation error at current iteration of critic update. This can be expressed by the following equation: (14) Furthermore, two types of error are used to derive upper bound of appximation error, including source error e S (Q φ ) and error e S (Q φ , y π θn ). Source error error here is the difference between the Q-network, Q φ (s, a), and its target (labeling function) at the n-th iteration, y π θ n-1 , on a source distribution. Hence, the approximation error is given by e T (Q φ ) = E s,a∼P D n+1 [|Q φ (s, a) -Q π θn (s, a)|] = E s,a∼P D n+1 [ s,a n+1 ]. e S (Q φ ) = E s,a∼P D n [|Q φ (s, a) -y π θ n-1 (s, a)|] = E s,a∼P D n , s ∼P r(•|s,a), a ∼π θ n-1 [|Q φ (s, a) -[R t+1 + γQ φn-1 (s , a )]|]. Error e S (Q φ , y π θn ) is the difference between the Q-network, Q(s, a), and its target at (n + 1)-th iteration, y π θn , on a source distribution, which is given by e S (Q φ , y π θn ) = E s,a∼P D n [|Q φ (s, a) -y π θn (s, a)|] = E s,a∼P D n , s ∼P r(•|s,a), a ∼π θn [|Q φ (s, a) -[R t+1 + γQ φn (s , a )]|]. Target error can be derived into the following inequality: e T (Q φ ) =e T (Q φ ) + e S (Q φ ) -e S (Q φ ) + e S (Q φ , y π θn ) -e S (Q φ , y π θn ) ≤e S (Q φ ) + |e S (Q φ , y π θn ) -e S (Q φ )| + |e T (Q φ ) -e S (Q φ , y π θn )| ≤e S (Q φ ) + E s,a∼P D n [|y π θn (s, a) -y π θ n-1 (s, a)|] + |e T (Q φ ) -e S (Q φ , y π θn )|, where the third term in the third line, |e T (Q φ ) -e S (Q φ , y π θn )|, is transformed further as: |e T (Q φ ) -e S (Q φ , y π θn )| = E s,a∼P D n+1 [|Q φ (s, a) -y π θn (s, a)|] -E s,a∼P D n [|Q φ (s, a) -y π θn (s, a)|] ≤γ E s,a∼P D n E s ∼P r(•|s,a), a ∼π θn (•|s ) [Q φn (s , a )] - E s ∼P r(•|s,a), a ∼π θ n-1 (•|s ) [Q φn (s , a )] . ( ) Recall that D is actually the replay buffer; and P D denotes the distribution that describes the likelihoods of samples drawn from replay buffer D uniformly. Because, in experience replay mechanism (Lin, 1992) , the number of samples in D n+1 is only a little more than in D n , the difference between P D n and P D n+1 can be ignored. Finally, the upper bound of error in Q-network is determined by e T (Q φ ) ≤e S (Q φ ) + E s,a∼P D n [|y π θn (s, a) -y π θ n-1 (s, a)|] + γ E s,a∼P D n E s ∼P r(•|s,a), a ∼π θn (•|s ) [Q φn (s , a )] - E s ∼P r(•|s,a), a ∼π θ n-1 (•|s ) [Q φn (s , a )] . ( ) It is noticeable that the third term in the upper bound will be smaller if the more similar the two consecutive policies, π θn-1 and π θn are. Hence, we can conclude that confining the KL divergence between every two consecutive policies can help limit the approximation error during the optimization process of actor-critic. This conclusion is used to design the learning method of the policy.

3.3. ACTOR MODULE-LEARNING A POLICY

The learning task of the actor module is to learn a parameterized stochastic policy, π θ (a|s).In order to lower the upper bound of approximation error of Q function (or to reduce the optimization difficulty of the Q functions), the goal of the policy is converted from maximizing expect discounted return two parts: maximizing the estimated Q values-the minimum of the two Q approximatorsand, concurrently, minimizing the KL divergence between two successive policies. The optimization objective is specified by max θ E s∼P D [ min i=1,2 Q φi (s, a θ (s)) -D KL (π θ (•|s), π θ old (•|s))], ( ) where D is the replay buffer; P D denotes the distribution that describes the likelihoods of samples drawn from D uniformly; θ is the parameters of the policy network; θ old is the parameters of the policy updated in the last iteration; a θ (s) is the samples drawn from the target stochastic policyπ θ (a|s). Note that, in order to back-propagate the error through this sampling operation, we use a Diagonal Gaussian policy and the reparameterization trick, i.e. samples are obtained according to a θ (s) = µ θ (s) + σ θ (s) ξ, ξ ∼ N (0, I), where µ θ (s) and σ θ (s) are the output of the policy network, and denotes the mean and covariance matrix's diagonal elements, respectively. The KL divergence between two distributions, for example p(x) and q(x), can be thought of the difference between the cross entropy and entropy, which is specified by D KL (p||q) = H(p, q) -H(p), where H(p, q) denotes the cross entropy between p(x) and q(x), and H(p) denotes the entropy of p(x). In practice, we find it is more effective to minimize the cross entropy between two successive policies and to maximize the entropy of current policy, separately, than to minimize the KL divergence directly. Hence, we expand the original objective Eq. 20 into the following one: max θ E s∼P D [ min i=1,2 Q φi (s, a θ (s)) -αH(π θ (•|s), π θ old (•|s)) + βH(π θ (•|s))], where α and β denote the coefficients of the cross entropy and the entropy, respectively. Moreover, we adopt a mechanism to automatically adjust α and β. To do this, α is adjusted by keeping the current cross entropy close to a target value. This mechanism is specified by the following optimization objective: min α E s∼P D [log α • ((δ KL + δ entropy ) -H(π θ (•|s), π θ old (•|s)))], where δ KL denotes the target KL value and δ entropy denotes the target entropy. Note that log(•) is use to ensure that α is greater than 0. β is adjusted in the same way: min β E s∼P D [log β • (H(π θ (•|s)) -δ entropy )]. The overall training process is summarized in Appendix A, and code can be found on our GitHub https://github.com/SingerGao/ECAC.

4. EXPERIMENTS

The experiments aim to evaluate the effectiveness of the proposed strategy to lower the approximation error and to verify that the proposed method can outperform other mainstream RL algorithms. The experiments for ablation study and comparative evaluation are conducted on a few challenging continuous control tasks from the OpenAI Gym (Brockman et al., 2016) environments, which includes Mujoco (Todorov et al., 2012) and Pybullet (Coumans & Bai, 2016 -2019) versions. Implementation details and hyperparameters of ECAC are presented in Appendix B.

4.1. ABLATION STUDY

Ablation studies are performed to verify the contribution of the operation of KL limitation. We compared the performance of ECAC with the method removing KL limitation from ECAC. Figure 2 compares five different instances for both methods with and without KL limition using different random seeds; and each instance performs five evaluation episodes every 1, 000 environment steps. The solid curves corresponds to the mean and the shaded region to the minimum and maximum returns over the five runs. The experimental result shows that our method with KL limitation performs better than the one without KL limitation. Figure 3 demonstrates that by using ECAC the KL divergence remains comfortably low during all the training process. Furthermore, to verify that confining the KL divergence can decrease the approximation error in Q function, we measured the normalized approximation error in 100 random states every 10, 000 environment steps. The normalized approximation error is specified by e Q = |Q approx -Q true | |Q ture | (26) where Q approx is the approximate Q value given by the current Q network, and Q ture is the true discounted return. The true value is estimated using the average discounted return over 100 episodes following the current policy, starting from states sampled from the replay buffer. Figure 4 shows that the method with KL limitation has lower error in the Q function. The results of all the ablation studies indicates that the approximation error of Q functioncan be decreased and the performance of the RL algorithm can be improved by placing restrictions on KL divergence between every two consecutive policies.

4.2. COMPARATIVE EVALUATION

Comparative evaluation are conducted to verify that our method can outperform other traditional RL methods including A2C, PPO, TD3, and SAC. Five individual runs of each algorithm with different random seeds are done; and each run performs five evaluation episodes every 1, 000 environment steps. Our results are reported five random seeds (one random seed for one individual run) of the Gym simulator, the network initialization, and sampling actions from policy during the training. The results of max average return over five runs on all the 10 tasks are presented in Table 1 . ECAC outperforms all other algorithms on the tasks except Hopper-v3 and HumanoidBulletEnv-v0 are only next to TD3 on Hopper-v3 and HumanoidBulletEnv-v0. Figure 5 and Figure 6 demonstrates learning curves of comparative evaluation on the 10 continuous control tasks (Mujoco and PyBullet version, respectively).

5. CONCLUSION

This paper presented a model-free actor-critic method based on a finding that the approximation error in value function of RL methods can be decreased by placing restrictions on KL-divergence between every two consecutive policies. Our method increases the similarity between every two consecutive polices in the training process and therefore reduces the optimization difficulty of value function. In the ablation studies, we compare the approximation error in Q function, KL divergence, and performance of the methods with and without KL limitation. The results of ablation study show that the proposed method can decrease the approximation error and improved the performance. Moreover, the results of comparative evaluation demonstrate that ECAC outperforms other modelfree deep RL algorithm including A2C, PPO, TD3, and SAC. Reset environment. 3: for t = 1, T do 4: Obeserve state s and select action a ∼ π(•|s).

5:

Execute a in the enviroment. 6: Observe next state s and reward r. Update Q functions by one step of gradient descent using ∇ φ 1 N (s,a,r,s )∈B (Q φi (s, a) -y(r, s )) 2 , i = 1, 2. 11: Backup old policy, θ old ← θ. 12: Update α and β by one step of gradient ascent using where δ KL and δ entropy denote target KL divergence and target entropy, respectively. where a θ (s) is a sample from π θ (•|s), which is differentiable with respect to θ via the reparameterization trick. 14: end for 15: end for 

B IMPLEMENTATION DETAILS AND HYPERPARAMETERS OF ECAC

For the implementation of ECAC, a two layer feedforward neural network of 256 hidden units, with rectified linear units (ReLU) between each layer are used to build the two Q functions an the policy. The parameters of the neural networks and the two coefficients (i.e α and β) are optimized by using Adam (Kingma & Ba, 2014) . The hyperparameters of ECAC are listed in Table 2 . Moreover, we adopt the target network technique in ECAC, which is common in previous works (Lillicrap et al., 2016; Fujimoto et al., 2018) . We also adopt reward scale trick which is presented in Haarnoja et al. (2018) ; and the reward scale parameter is listed in Table 3 .



Figure 1: Alternating process of Actor-critic alternates between value function and policy updates.

Figure 2: Performance comparison of the method with KL limitation and the one without KL limitation on the Hopper-v3 and Walker2d-v3 benchmark. The method with KL limitation performs better. Curves are smoothed uniformly for visual clarity.

Figure 3: The KL divergence comparison of the method with KL limitation and the one without KL limitation on the Hopper-v3 and Walker2d-v3 benchmark.

Figure 5: The results of comparative evaluation on Mujoco version of the OpenAI gym continuous control tasks. Curves are smoothed uniformly for visual clarity.

Figure 6: The results of comparative evaluation on Pybullet version of the OpenAI gym continuous control tasks. Curves are smoothed uniformly for visual clarity.

(s, a, r, s ) in replay buffer D.8:Randomly sample a minibatch of N transitions, B = {(s, a, r, s )} from D.9:Compute targets for the Q functions:y(r, s ) = r + γ min i=1,2 Q φi (s , a ), a ∼ π θ (•|s ).10:

α • ((δ KL + δ entropy ) -H(π θ (•|s), π θ old (•|s)))], ∇ β 1 N s∈B [log β • (H(π θ (•|s)) -δ entropy )],

(s, a θ (s)) + αH(π θ (•|s)) -βD KL (π θ (•|s), π θ old (•|s))),

12)

Max average return over five runs on all the 10 tasks. Maximum value for each task is bolded. ± corresponds to standard deviation over runs. Error controlled Actor-critic. Require: initial policy patameters, θ; Q function parameters, φ 1 and φ 2 ; discount rate, γ; the coefficient of KL term, β; empty replay buffer D; the number of episodes, M; the maximum number of steps in each episode, T; minibatch size, N. Ensure: optimal policy parameters θ * .

ECAC Hyperparameters

