FORK: A FORWARD-LOOKING ACTOR FOR MODEL-FREE REINFORCEMENT LEARNING

Abstract

In this paper, we propose a new type of Actor, named forward-looking Actor or FORK for short, for Actor-Critic algorithms. FORK can be easily integrated into a model-free Actor-Critic algorithm. Our experiments on six Box2D and MuJoCo environments with continuous state and action spaces demonstrate significant performance improvement FORK can bring to the state-of-the-art algorithms. A variation of FORK can further solve Bipedal-WalkerHardcore in as few as four hours using a single GPU.

1. INTRODUCTION

Deep reinforcement learning has had tremendous successes, and sometimes even superhuman performance, in a wide range of applications including board games (Silver et al., 2016) , video games (Vinyals et al., 2019) , and robotics (Haarnoja et al., 2018a) . A key to these recent successes is the use of deep neural networks as high-capacity function approximators that can harvest a large amount of data samples to approximate high-dimensional state or action value functions, which tackles one of the most challenging issues in reinforcement learning problems with very large state and action spaces. Many modern reinforcement learning algorithms are model-free, so they are applicable in different environments and can readily react to new and unseen states. This paper considers model-free reinforcement learning for problems with continuous state and action spaces, in particular, the Actor-Critic method, where Critic evaluates the state or action values of the Actor's policy and Actor improves the policy based on the value estimation from Critic. To draw an analogy between Actor-Critic algorithms and human decision making, consider the scenario where a high school student is deciding on which college to attend after graduation. The student, like Actor, is likely to make her/his decision based on the perceived values of the colleges, where the value of a college is based on many factors including (i) the quality of education it offers, its culture, and diversity, which can be viewed as instantaneous rewards of attending the college; and (ii) the career opportunities after finishing the college, which can be thought as the future cumulative reward. We now take this analogy one step further, in human decision making, we often not only consider the "value" of current state and action, but also further forecast the outcome of the current decision and the value of the next state. In the example above, a student often explicitly takes into consideration the first job she/he may have after finishing college, and the "value" of the first job. Since forward-looking is common in human decision making, we are interested in understanding whether such forward-looking decision making can help Actor; in particular, whether it is useful for Actor to forecast the next state and use the value of future states to improve the policy. To our great surprise, a relative straightforward implementation of forward-looking Actor, as an add-on to existing Actor algorithms, improves Actor's performance by a large margin. Our new Actor, named FOrward-looKing Actor or FORK for short, mimics human decision making where we think multi-step ahead. In particular, FORK includes a neural network that forecasts the next state given the current state and current action, called system network; and a neural network that forecasts the reward given a (state, action) pair, called reward network. With the system network and reward network, FORK can forecast the next state and consider the value of the next state when improving the policy. For example, consider the Deep Deterministic Policy Gradient (DDPG) (Lillicrap et al., 2015) , which updates the parameters of Actor as follows: φ ← φ + β∇ φ Q ψ (s t , A φ (s t )), where s t is the state at time t, φ are Actor's parameters, β is the learning rate, Q ψ (s, a) is the Critic network, and A φ (s) is the Actor network. With DDPG-FORK, the parameters can be updated as follows: φ ←φ + β (∇ φ Q ψ (s t , A φ (s t )) + ∇ φ R η (s t , A φ (s t )) + γ∇ φ R η (s t+1 , A φ (s t+1 ))+ γ 2 ∇ φ Q ψ (s t+2 , A φ (s t+2 )) , where R η is the reward network, and st+1 and st+2 are the future states forecast by the system network F θ . We will see that FORK can be easily incorporated into most deep Actor-Critic algorithms, by adding two additional neural networks (the system network and the reward network), and by adding extra terms to the loss function when training Actor, e.g. adding term R η (s t , A φ (s t )) + γR η (s t+1 , A φ (s t+1 )) + γ 2 Q ψ (s t+2 , A φ (s t+2 )) for each sampled state s t to implement (1). We remark that Equation (1) is just one example of FORK, FORK can have different implementations (a detailed discussion can be found in Section 3). We further remark that learning the system model is not a new idea and has a long history in reinforcement learning, called model-based reinforcement learning (some state-of-the-art model-based reinforcement learning algorithms and the benchmark can be found in (Wang et al., 2019) ). Model-based reinforcement learning uses the model in a sophisticated way, often based on deterministic or stochastic optimal control theory to optimize the policy based on the model. FORK only uses the system network as a blackbox to forecast future states, and does not use it as a mathematical model for optimizing control actions. With this key distinction, any model-free Actor-Critic algorithm with FORK remains to be model-free. In our experiments, we added FORK to two state-of-the-art model-free algorithms, according to recent benchmark studies (Duan et al., 2016a; Wang et al., 2019 ): TD3 (Fujimoto et al., 2018) (for deterministic policies) and SAC (Haarnoja et al., 2018b ) (for stochastic policies). The evaluations on six challenging environments with continuous state space and action space show significant improvement when adding FORK. In particular, TD3-FORK performs the best among the all we tested. For Ant-v3, it improves the average cumulative reward by more than 50% than TD3, and achieves TD3's best performance using only 35% of training samples. BipedalWalker-v3 is considered "solved" when the agent obtains an average cumulative reward of at least 300foot_0 . TD3-FORK only needs 0.23 million actor training steps to solve the problem, half of that under TD3. Furthermore, a variation of TD3-FORK solves BipedalWalkerHardcore, a well known difficult environment, with as few as four hours using a single GPU.

1.1. RELATED WORK

The idea of using learned models in reinforcement learning is not new, and actually has a long history in reinforcement learning. At a high level, FORK shares a similar spirit as model-based reinforcement learning and rollout. However, in terms of implementation, FORK is very different and much simpler. Rollout in general requires the Monte-Carlo method (Silver et al., 2017) to simulate a finite number of future states from the current state and then combines that with value function approximations to decide the action to take at the current time. FORK does not require any high-fidelity simulation. The key distinction between FORK and model-based reinforcement learning is that model-based reinforcement learning uses the learned



https://github.com/openai/gym/blob/master/gym/envs/box2d/bipedal_walker.py

