FORK: A FORWARD-LOOKING ACTOR FOR MODEL-FREE REINFORCEMENT LEARNING

Abstract

In this paper, we propose a new type of Actor, named forward-looking Actor or FORK for short, for Actor-Critic algorithms. FORK can be easily integrated into a model-free Actor-Critic algorithm. Our experiments on six Box2D and MuJoCo environments with continuous state and action spaces demonstrate significant performance improvement FORK can bring to the state-of-the-art algorithms. A variation of FORK can further solve Bipedal-WalkerHardcore in as few as four hours using a single GPU.

1. INTRODUCTION

Deep reinforcement learning has had tremendous successes, and sometimes even superhuman performance, in a wide range of applications including board games (Silver et al., 2016) , video games (Vinyals et al., 2019) , and robotics (Haarnoja et al., 2018a) . A key to these recent successes is the use of deep neural networks as high-capacity function approximators that can harvest a large amount of data samples to approximate high-dimensional state or action value functions, which tackles one of the most challenging issues in reinforcement learning problems with very large state and action spaces. Many modern reinforcement learning algorithms are model-free, so they are applicable in different environments and can readily react to new and unseen states. This paper considers model-free reinforcement learning for problems with continuous state and action spaces, in particular, the Actor-Critic method, where Critic evaluates the state or action values of the Actor's policy and Actor improves the policy based on the value estimation from Critic. To draw an analogy between Actor-Critic algorithms and human decision making, consider the scenario where a high school student is deciding on which college to attend after graduation. The student, like Actor, is likely to make her/his decision based on the perceived values of the colleges, where the value of a college is based on many factors including (i) the quality of education it offers, its culture, and diversity, which can be viewed as instantaneous rewards of attending the college; and (ii) the career opportunities after finishing the college, which can be thought as the future cumulative reward. We now take this analogy one step further, in human decision making, we often not only consider the "value" of current state and action, but also further forecast the outcome of the current decision and the value of the next state. In the example above, a student often explicitly takes into consideration the first job she/he may have after finishing college, and the "value" of the first job. Since forward-looking is common in human decision making, we are interested in understanding whether such forward-looking decision making can help Actor; in particular, whether it is useful for Actor to forecast the next state and use the value of future states to improve the policy. To our great surprise, a relative straightforward implementation of forward-looking Actor, as an add-on to existing Actor algorithms, improves Actor's performance by a large margin. Our new Actor, named FOrward-looKing Actor or FORK for short, mimics human decision making where we think multi-step ahead. In particular, FORK includes a neural network that forecasts the next state given the current state and current action, called system network; and a neural network that forecasts the reward given a (state, action) pair, called reward network. With the system network and reward network, FORK can forecast the next state and consider the value of the next state when improving the policy. For example, consider the Deep Deterministic Policy Gradient (DDPG) (Lillicrap et al., 2015) , which updates the parameters of Actor as follows: φ ← φ + β∇ φ Q ψ (s t , A φ (s t )), where s t is the state at time t, φ are Actor's parameters, β is the learning rate, Q ψ (s, a) is the Critic network, and A φ (s) is the Actor network. With DDPG-FORK, the parameters can be updated as follows: φ ←φ + β (∇ φ Q ψ (s t , A φ (s t )) + ∇ φ R η (s t , A φ (s t )) + γ∇ φ R η (s t+1 , A φ (s t+1 ))+ γ 2 ∇ φ Q ψ (s t+2 , A φ (s t+2 )) , where R η is the reward network, and st+1 and st+2 are the future states forecast by the system network F θ . We will see that FORK can be easily incorporated into most deep Actor-Critic algorithms, by adding two additional neural networks (the system network and the reward network), and by adding extra terms to the loss function when training Actor, e.g. adding term R η (s t , A φ (s t )) + γR η (s t+1 , A φ (s t+1 )) + γ 2 Q ψ (s t+2 , A φ (s t+2 )) for each sampled state s t to implement (1). We remark that Equation (1) is just one example of FORK, FORK can have different implementations (a detailed discussion can be found in Section 3). We further remark that learning the system model is not a new idea and has a long history in reinforcement learning, called model-based reinforcement learning (some state-of-the-art model-based reinforcement learning algorithms and the benchmark can be found in (Wang et al., 2019) ). Model-based reinforcement learning uses the model in a sophisticated way, often based on deterministic or stochastic optimal control theory to optimize the policy based on the model. FORK only uses the system network as a blackbox to forecast future states, and does not use it as a mathematical model for optimizing control actions. With this key distinction, any model-free Actor-Critic algorithm with FORK remains to be model-free. In our experiments, we added FORK to two state-of-the-art model-free algorithms, according to recent benchmark studies (Duan et al., 2016a; Wang et al., 2019) : TD3 (Fujimoto et al., 2018) (for deterministic policies) and SAC (Haarnoja et al., 2018b ) (for stochastic policies). The evaluations on six challenging environments with continuous state space and action space show significant improvement when adding FORK. In particular, TD3-FORK performs the best among the all we tested. For Ant-v3, it improves the average cumulative reward by more than 50% than TD3, and achieves TD3's best performance using only 35% of training samples. BipedalWalker-v3 is considered "solved" when the agent obtains an average cumulative reward of at least 300foot_0 . TD3-FORK only needs 0.23 million actor training steps to solve the problem, half of that under TD3. Furthermore, a variation of TD3-FORK solves BipedalWalkerHardcore, a well known difficult environment, with as few as four hours using a single GPU.

1.1. RELATED WORK

The idea of using learned models in reinforcement learning is not new, and actually has a long history in reinforcement learning. At a high level, FORK shares a similar spirit as model-based reinforcement learning and rollout. However, in terms of implementation, FORK is very different and much simpler. Rollout in general requires the Monte-Carlo method (Silver et al., 2017) to simulate a finite number of future states from the current state and then combines that with value function approximations to decide the action to take at the current time. FORK does not require any high-fidelity simulation. The key distinction between FORK and model-based reinforcement learning is that model-based reinforcement learning uses the learned model in a sophisticated manner. For example, in SVG (Heess et al., 2015) , the learned system model is integrated as a part of the calculation of the value gradient, in (Gu et al., 2016) , refitted local linear model and rollout are used to derive linear-Gaussian controller, and (Bansal et al., 2017) uses a learned dynamical model to compute the trajectory distribution of a given policy and consequently estimates the corresponding cost using a Bayesian optimization-based policy search. More model-based reinforcement learning algorithms and related benchmarking can be found in (Wang et al., 2019) . FORK, on the other hand, only uses the system network to predict future states, and does not use the system model beyond that. Other related work that accelerates reinforcement learning algorithms includes: acceleration through exploration strategies (Gupta et al., 2018) , optimizers (Duan et al., 2016b) , and intrinsic reward (Zheng et al., 2018) , just to name a few. These approaches are complementary to ours. FORK can be added to further accelerate learning.

2. BACKGROUND

Reinforcement Learning algorithms aim at learning policies that maximize the cumulative reward by interacting with the environment. We consider a standard reinforcement learning setting defined by a Markov decision process (MDP) (S, A, p 0 , p, r, γ), where S is a set of states, A is the action space, p 0 (s) is the probability distribution of the initial state, p : S × S × A → [0, ∞) is the transition density function, which represents the distribution of the next state s t+1 given current state s t and action a t , r : S ×A → [r min , r max ] is the bounded reward function on each transition, and γ ∈ (0, 1] is the discount factor. We consider a discrete-time system. At each time step t, given the current s t ∈ S, the agent selects an action a t ∈ A based on a (deterministic or stochastic) policy π(a t |s t ), which moves the environment to the next state s t+1 , and yields a reward r t = r(s t , a t ) to the agent. We consider stationary policies in this paper under which the action is taken based on s t , and is independent of other historical information. Starting from time 0, the return of given policy π is the discounted cumulative reward J π (i) = T t=0 γ t r(s t , a t ), given s 0 = i. J π (i) is also called the state-value function. Our goal is to learn a policy π * that maximizes this cumulative reward π * ∈ arg max π J π (i) ∀i. We assume our policy is parameterized by parameter φ, denoted by π φ , e.g. by the Actor network in Actor-Critic Algorithms. In this case, our goal is to identify the optimal parameter φ * that maximizes φ * ∈ arg max J π φ (i). Instead of state-value function, it is often convenient to work with action-value function, Q-function, which is defined as follows: Q π (s, a) = E [r π (s, a) + γJ π (s )] , where s is the next state given current state s and action a. The optimal policy is a policy that satisfies the following Bellman equation (Bellman, 1957) : Q π * (s, a) = E r(s, a) + γ max a ∈A Q π * (s , a ) . When neural networks are used to approximate action-value functions, we denote the action-value function by Q ψ (s, a), where ψ is the parameters of the neural network.

3. FORK -FORWARD-LOOKING ACTOR

This paper focuses on Actor-Critic algorithms, where Critic estimates the state or action value functions of the current policy, and Actor improves the policy based on the value functions. We propose a new type of Actor, FORK. More precisely, a new training algorithm that improves the policy by considering not only the action-value of the current state (or states of the current mini-batch), but also future states and actions forecast using a learned system model and a learned reward model. This forward-looking Actor is illustrated in Figure 1 . In FORK, we introduce two additional neural networks: Figure 1 : FORK includes three neural networks, the policy network A φ , the system model F θ , and the reward model R η . The system network F θ . The network is used to predict the next state of the environment, i.e., given current state s t and action a t , it predicts the next state st+1 = F θ (s t , a t ). With experiences (s t , a t , s t+1 ), training the system network is a supervised learning problem. The neural network can be trained using mini-batch from replaybuffer and smooth-L1 loss L(θ) = s t+1 - F θ (s t , a t ) smooth L1 . The reward network R η . This network predicts the reward given current state s t and action a t , i.e. rt = R η (s t , a t ). The network can be trained from experience (s t , a t , r t ), with MSE loss L(η) = r t -R η (s t , a t ) 2 . FORK. With the system network and the reward network, the agent can forecast the next state, the next next stat and so on. Actor can then use the forecast to improve the policy. For example, we consider the following loss function L(φ) = E -Q ψ (s t , A φ (s t )) -R η (s t , A φ (s t )) -γR η (s t+1 , A φ (s t+1 )) -γ 2 Q ψ (s t+2 , A φ (s t+2 )) . In the loss function above, s t are from data samples (e.g. replay buffer), st+1 and st+2 are calculated from the system network as shown below: st+1 = F θ (s t , A φ (s t )) and st+2 = F θ (s t+1 , A φ (s t+1 )). Note that when training Actor A φ with loss function L(φ), all other parameters in L(φ) are regarded as constants except φ (see the PyTorch code in the supplemental materials). The action-function Q, without function approximation, under current policy A φ satisfies Q(s t , A φ (s)) = E r(s t , A φ (s t )) + γr(s t+1 , A φ (s t+1 )) + γ 2 Q (s t+2 , A φ (s t+2 )) , where r, s t+1 and s t+2 are the actual rewards and states under the current policy, not estimated values. Therefore, the loss function L(φ) can be viewed as the average of two estimators. Given action values from Critic and with a mini-batch of size N, FORK updates its parameters as follows: φ ← φ -β t θ L(φ), where β t is the learning rate and φ L(φ) = 1 N N i=1 ∇ a Q ψ (s i , a)| a=A φ (si) ∇ φ A φ (s i ) + ∇ a R η (s i , a)| a=A φ (si) ∇ φ A φ (s i ) +γ∇ a R η (s i , a)| a=A φ (s i ) ∇ φ A φ (s i ). + γ 2 ∇ a Q ψ (s i , a)| a=A φ (s i ) ∇ φ A φ (s i ) , where s i and s i are the next state and the next next state estimated from the system network. We note that it is important to use the system network to generate future states as in Equation (3) because they mimic the states under the current policy. If we would sample a sequence of consecutive states from the replay buffer, then the sequence is from the old policy, which does not help the learning. Figure 2 compares TD3-FORK, TD3, and TD3-MT, which samples a sequence of three consecutive states, on the BipedalWalker-v3 environment. We can clearly see that simply using consecutive states from experiences does not help improve learning. In fact, it significantly hurts the learning. Modified Reward Model: We found from our experiments that the reward network can more accurately predict reward r t when including the next state s t+1 as input into the reward network (an example can found in Appendix A.1). Therefore, we use a modified reward network R η (s t , a t , s t+1 ) in FORK. Adaptive Weight: Loss function L(φ) in our algorithm uses the system network and the reward network to boost learning. In our experiments, we found that the forecasting can significantly improve the performance, except at the end of learning. Since the system and reward networks are not perfect, the errors in prediction can introduce errors/noises. To overcome this issue, we found it is helpful to use an adaptive weight w so that FORK accelerates learning at the beginning but its weight decreases gradually as it gets close to the learning goal. A comparison between fixed weights and adaptive weights can be found in Appendix A.2. We use a simple adaptive weight w = r r0 1 0 w 0 , where r is the moving average of cumulative reward (per episode), and r 0 is a predefined goal, w 0 is the initial weight, and (a) 1 0 = a if 0 ≤ a ≤ 1, = 0 if a < 0 and = 1 if a > 1. The loss function with adaptive weight becomes L(φ) = E -Q ψ (s t , A φ (s t )) -wR η (s t , A φ (s t )) -wγR η (s t+1 , A φ (s t+1 )) -wγ 2 Q ψ (s t+2 , A φ (s t+2 )) . (4) Furthermore, we set a threshold and let w = 0 if the loss of the system network is larger than the threshold. This is to avoid using FORK when the system and reward networks are very noisy. We note that in our experiments, the thresholds were chosen such that w = 0 for around 20, 000 steps at the beginning of each instance, which includes the first 10,000 random exploration steps. Different Implementations of FORK: It is easy to see FORK can be implemented in different forms. For example, instead of two-step ahead, we can use one-step ahead as follows: L(φ) = E [-Q ψ (s t , A φ (s t )) -wR η (s t , A φ (s t )) -wγQ ψ (s t+1 , A φ (s t+1 ))] , or only use future action values: L(φ) = E [-Q ψ (s t , A φ (s t )) -w (Q ψ (s t+1 , A φ (s t+1 )) + w Q ψ (s t+2 , A φ (s t+2 )))] . We compared these two versions with FORK. The performance comparison can be found in Appendix B.3.

4. EXPERIMENTS

In this section, we evaluate FORK as an add-on to existing algorithms. We name an algorithm with FORK as algorithm-FORK, e.g. TD3-FORK or SAC-FORK. As an example, a detailed description of TD3-FORK can be found in Appendix A.3. We focused on two algorithms: TD3 (Fujimoto et al., 2018) and SAC (Haarnoja et al., 2018b) because they were found to have the best performance among model-free reinforcement learning algorithms in recent benchmarking studies (Duan et al., 2016a; Wang et al., 2019) . We compared the performance of TD3-FORK and SAC-FORK with TD3, SAC and DDPG (Lillicrap et al., 2015) .

4.1. BOX2D AND MUJOCO ENVIRONMENTS

We selected six environments: BipedalWalker-v3 from Box2D (Catto, 2011) , Ant-v3, Hooper-v3, HalfCheetah-v3, Humanoid-v3 and Walker2d-v3 from MuJoCo (Todorov et al., 2012) as shown in Figure 3 . All these environments have continuous state spaces and action spaces. Hyperparameters. Because FORK is an add-on, for TD3, we used the authors' implementation (https: //github.com/sfujim/TD3); for SAC, we used a PyTorch version (https://github.com/ vitchyr/rlkit) recommended by the authors without any change except adding FORK. The hyperparameters of both TD3 and SAC are summarized in Table 3 in Appendix A.4, and the hyperparameters related to FORK are summarized in Table 4 in the same appendix. We can see TD3-FORK does not require much hyperparameter tuning. The system network and reward network used in the environments are the same except for the Humanoid-v3 for which we use larger system and reward networks because the dimension of the system is higher than other systems. The base weight w 0 is the same for all environments, the base rewards are the typical cumulative rewards under TD3 after a successful training, and the system thresholds are the typical estimation errors after about 20,000 steps. SAC-FORK requires slightly more hyperparameter tuning. The base weights were chosen to be smaller values, the base rewards are the typical cumulative rewards under SAC, and the system thresholds are the same as those under TD3-FORK. Initial Exploration. For each task and each algorithm, we use a random policy for exploration for the first 10,000 steps. Each step is one interaction with the environment.

Duration of Experiments.

For each environment and each algorithm, we ran five different instances with different random seeds. Since we focus on Actor performance, Actor was trained for 0.5 million times for each instance. Since TD3 uses a delayed Actor with frequency 2 (i.e. Actor and Critic are trained with 1:2 ratio), Critic was trained one million times under TD3 and TD3-FORK. For SAC, SAC-FORK and DDPG, Critic was trained 0.5 million times. The performance with the same amount of total training, including Critic training and Actor training, can be found in Appendix B.2, where for each algorithm, Critic and Actor, together, were trained 1.5 millions times.

4.3. RESULTS

Figure 4 shows the average cumulative rewards, where we evaluated the policies every 5,000 steps without exploration noise during training process. Each evaluation was averaged over 10 episodes. We train five different instances for each algorithm with same random seeds. The solid curves shows the average cumulative rewards (per episode), and the shaded region represents the standard deviations. The best average cumulative rewards (its definition can be found in Appendix B.1) are summarized in Table 1 . We can see that TD3-FORK outperforms all other algorithms. For Ant-v3, TD3-FORK improves the best average cumulative reward by more than 50% (5699.37 (TD3-FORK) versus 3652.11 (TD3)). We also studied the improvement in terms of sample complexity. In In summary, FORK improves the performance of both TD3 and SAC after being included as an add-on. The improvement is more significant when adding to TD3 than adding to SAC. FORK improves TD3 in all six environments, and improves SAC in three of the six environments. Furthermore, TD3-FORK performs the best in all six environment. More statistics about this set of experiments can be found in Appendix B.1. In Appendix B.2, we also presented experimental results where Actor and Critic together have the same amount of training across all algorithms (i.e. under TD3 and TD3-FORK, Actor was trained 0.5 million times and Critic was trained 1 million times; and under other algorithms, both Actor and Critic were trained 0.75 million times). In this case, TD3-FORK performs the best among four of the six environments, and SAC-FORK performs the best in the rest two environments.

4.4. BIPEDALWALKER-HARDCORE-V3

A variation of TD3-FORK can also solve a well-known difficult environment, BipedalWalker-Hardcore-v3, in as few as four hours using a single GPU. From the best of our knowledge, the known algorithm needs to train for days on a 72 cpu AWS EC2 with 64 worker processes taking the raw frames as input (https://github.com/dgriff777/a3c_continuous). You can view the performance on BipedalWalkerHardcore-v3 during and after training at https://youtu.be/0nYQpXtxh-Q. The implementation details can be found in Appendix C.

5. CONCLUSIONS

This paper proposes FORK, forward-looking Actor, as an add-on to Actor-Critic algorithms. The evaluation of six environments demonstrated the significant performance improvements by adding FORK to two state-of-the-art model-free reinforcement learning algorithms. A variation of TD3-FORK further solved BipedalWalkerHardcore in as few as four hours with a single GPU. This appendix provides additional details about FORK and additional experiments. The appendix is organized as follows: • In Section A, we provide additional details about FORK used in our experiments, including a description of TD3-FORK and the hyperparamters. • In Section B, we present additional experimental results, including additional statistics of the experiments conducted in Section 4, performance comparison under the same number of Actor+Critic training, and performance of different implementations of FORK. • In Section C, we present additional changes we made when using a variation of TD3-FORK, in particular, TD3-FORK-DQ, to solve the BipedalWalkerHardcore-v3.

A ADDITIONAL DETAILS OF FORK

A.1 REVISED REWARD NETWORK We found from our experiments that the reward network can more accurately predict reward r t when including the next state s t+1 as input into the reward network. Figure 5 shows the mean-square-errors (MSE) of the reward network with (s t , a t ) as the input versus with (s t , a t , s t+1 ) as the input for BipedalWalker-v3 during the first 10,000 steps. We can clearly see that MSE is lower in the revised reward network. 

A.3 TD3-FORK

The detailed description of TD3-FORK can be found in Algorithm 1, and the codes are also submitted as a supplemental material. A.4 HYPERPARAMETERS Table 3 lists the hyper-parameter used in DDPG, SAC, SAC-FORK and TD3-FORK. We kept the same hyperparamter values used in SAC and TD3 codes provided or recommended by the authors. We did not tune these parameters because the goal is to show that FORK is a simple yet powerful add-on to existing Actor-Critic algorithms. Table 4 summarizes the environment specific parameters. In particular, the base weight and base cumulative reward used in implementing the adaptive weight, and threshold for adding FORK. The base cumulative rewards for TD3-FORk are the typical cumulative rewards under TD3 after training Actor for 0.5 million steps. The base cumulative rewards for SAC-FORK are similarly chosen but with a more careful tuning. The thresholds are the typical loss values after training the system networks for about 20,000 times including the first 10,000 exploration steps. In the implementation, FORK is added to Actor training only after the system network can predict the next state reasonably well. We observed TD3-FORK with our intuitive choices of Algorithm 1 TD3-FORK Initialize critic network Q ψ1 , Q ψ2 system network F θ , R η and actor network A φ with random parame- ters ψ 1 , ψ 2 , θ, η, φ 1: Initialize target network φ ← φ, ψ 1 ← ψ 1 , ψ 2 ← ψ 2 Initialize replay buffer B, soft update parameter τ Initialize base reward r 0 , w 0 , threshold l and moving average reward r ← 0 Initialize noise clip bound c, state bound (o min , o max ) 2: for episode e = 1, . . . , M do Select action a t according to the current policy and exploration noise a t ∼ A φ (s) + t , where t ∼ N (0, σ) 7: Execute action a t and observe reward r t , new state s t+1 8: Store transition tuple (s t , a t , r t , s t+1 ) into replay buffer B 9: Sample a random minibatch of N transitions (s i , a i , r i , s i+1 ) from B 10: ãi ← π φ (s i+1 ) + , ∼ clip(N (0, σ), -c, c)) 11: Set y i = r i + γ min j=1,2 Q ψ i (s i+1 ) 12: r ← r + r t 13: Update critic network by minimizing the loss: L(ψ) = 1 N 2 j=1 i y i -Q ψj (s i , a i ) 2 14: Update state system network by minimizing loss: L(θ) = s i+1 -F θ (s i , a i ) smooth L1 15: Update reward system network by minimizing the loss: L(η) = 1 N i (r i -R η (s i , a i , s i+1 )) 2 16: if t mod d then 17: Update φ by the sampled policy gradient: 18: if L(θ) > l then 19: ∇ φ L(φ) = 1 N i ∇ a Q ψ1 (s i , a)| a=A φ (si) ∇ φ A φ (s i ) 20: else 21: s i+1 = clip(F θ (s i , A φ (s i ), o min , o max ), s i+2 = clip(F θ (s i+1 , A φ (s i+1 )), o min , o max ) 22: ∇ φ L(φ) = 1 N i ∇ a Q ψ1 (s i , a)| a=A φ (si) ∇ φ A φ (s i ) + w∇ a R η (s i , a, s i+1 )| a=A φ (si) ∇ φ A φ (s i ) 23: +wγ∇ a R η (s i+1 , a, s i+2 )| a=A φ (s i+1 ) ∇ φ A φ (s i+1 ) + wγ 2 ∇ a Q ψ1 (s i+2 , a)| a=A φ (s i+2 ) ∇ φ A φ (s i+2 ) 24: end if

25:

Update target networks: 26: φ ← τ φ + (1 -τ )φ 27: ψ i ← τ ψ i + (1 -τ )ψ i 28: end if 29: end for 30: Update r ← ((e -1)r + r)/e 31: Update adaptive weight w ← min(1 -max(0, r r0 ), 1)w 0 32: end for hyperparameters worked well across different environments and required little tuning, while SAC-FORK required some careful tuning on choosing the base weights and the base cumulative rewards. 5 summarizes the best average cumulative rewards, the associated standard-deviations, and best instance cumulative rewards. They are defined as follows. Recall that each algorithm is trained for five instances, where each instance includes 0.5 million steps of Actor training. During the training process, we evaluated the algorithm every 5,000 steps without the exploration noise. For each evaluation, we calculated the average cumulative rewards (without discount) over 10 episodes, where each episode is 0 ∼ 1, 600 under BipedalWalker-v3, is 0 ∼ 1, 000 under and In Section 4, the algorithms were compared assuming the same amount of Actor training since our focus is on the performance of Actor. Since TD3 uses delayed Actor training, Critic of TD3 and TD3-FORK is trained twice as much as Critic of SAC and SAC-FORk when Actor is trained the same number of steps, which gives advantage to TD3 and TD3-FORK. To further compare the performance of TD3-FORK and SAC-FORK, we present the results where for each algorithm, Actor and Critic, together, were trained 1.5 million steps. In particular, Actor was trained 0.5 million steps and Critic is trained 1 million steps under TD3 and TD3-FORK; and Actor and Critic were trained 0.75 million steps each under SAC and SAC-FORK. The results can be found in Figure 7 . 6) and w = 0 FORK-Q, standing for Q FORK. From Table 7 , we can see that in terms of best average cumulative reward, TD3-FORK performs the best four out of the six environments and TD3-FORK-S performs the best in the remaining two. This is the reason we selected the current form of FORK.

C BIPEDALWALKERHARDCORE

TD3-FORK-DQ can solve the difficult BipedalWalker-Hardcore-v3 environment with as few as four hours. The hardcore version is much more difficult than BipedalWalker. For example, a known algorithm needs to train for days on a 72 cpu AWS EC2 with 64 worker processes taking the raw frames as input (https: //github.com/dgriff777/a3c_continuous). TD3-FORK-DQ, a variation of TD3-FORK, can solve the problem in as few as four hours by using default GPU setting provided by Google Colabfoot_1 and with sensory data (not images). The performance on BipedalWalkerHardcore-v3 during and after training can be viewed at https://youtu.be/0nYQpXtxh-Q. The codes have been submitted as a supplementary materials. To solve BipedalWalkerHardcore, we made several additional changes. (i) We changed the -100 reward to -5. (ii) We increased other rewards by a factor of 5. (iii) We implemented a replay buffer where failed episodes, in which the bipedalwalker fell down at the end, and successful episodes are added to the replay-buffer with 5:1 ratio. The changes to the rewards (i) and (ii) were suggested in the blogfoot_2 . Using reward scaling to improve performance has been also reported in (Henderson et al., 2017) . We made change (iii) because we found failed episodes are more useful for learning than successful ones. The reason we believe is that when the bipidedalwalker already knows how to handle a terrain, there is no need to further train using the same type of terrain. When the training is near the end, most of the episodes are successful so adding these successful episodes overwhelm the more useful episodes (failed ones), which slows down the learning. 



https://github.com/openai/gym/blob/master/gym/envs/box2d/bipedal_walker.py https://colab.research.google.com/notebooks/intro.ipynb https://mp.weixin.qq.com/s?__biz=MzA5MDMwMTIyNQ==&mid=2649294554&idx=1&sn= 9f893801b8917575779430cae89829fb&scene=21#wechat_redirect



Figure 2: TD3-FORK versus TD3-MT

Figure 3: The six environment used in our experiments

Figure 4: Learning curves of six environments. Curves were smoothed uniformly for visual clarity.

Figure 5: Training losses under the two different reward networks

Figure 7: Learning curves of the six environments. Under each algorithm, Actor and Critic, together, were trained for 1.5 million steps. Curves are smoothed uniformly for visual clarity.

Figure 8: Learning curves of TD3-FORK, TD3, TD3-FORK-S, TD3-FORK-Q and TD3-FORK-DQ. Curves are smoothed uniformly for visual clarity.

Table 2, we summarized the number of Actor training required under TD3-FORK (SAC-FORK) to achieve the best average cumulative reward under TD3 (SAC). For example, for BipedalWalker-v3, TD3 achieved the best average cumulative reward with 0.4925 million steps of Actor training; and TD3-FORK achieved the same value with only 0.225 million steps of Actor training, reducing the required samples by more than a half. The best average cumulative rewards of the algorithms. The best value for each environment is highlighted in bold text.

Sample complexity (million). The number of training steps needed for TD3-FORK (SAC-FORK) to achieve the same best average cumulative reward under TD3 (SAC). The numbers under TD3 (SAC) are the time steps at which the TD3 (SAC) achieved the best average cumulative rewards.



Best Average Cumulative Rewards, Standard Deviations, and Best Instance Cumulative Rewards of TD3-FORK, TD3, DDPG, SAC, SAC-FORK over Six Environments. The Best Value for Each Environment is in Bold Text.

summarizes the best average cumulative rewards, standard-deviations, and the best instance cumulative rewards. We can see that in terms of the best average cumulative rewards, TD3-FORK performed the best in four out of the six environments, including BipedalWalker, Ant, Hopper and HalfCheetah; and SAC-FORK performed the best in the remaining two -Humanoid and Walker2d.

Best Average Cumulative Reward, Standard-Deviation, and Best Instance Cumulative Reward. The Best Value for Each Environment is in Bold Text.

Best Average Cumulative Rewards, Standard Deviations, and Best Instance Cumulative Rewards of TD3-FORK, TD3, TD3-FORK-S, TD3-FORK-Q, and TD3-FORK-DQ over Six Environments. The Best Values are in Bold Text.

