FORK: A FORWARD-LOOKING ACTOR FOR MODEL-FREE REINFORCEMENT LEARNING

Abstract

In this paper, we propose a new type of Actor, named forward-looking Actor or FORK for short, for Actor-Critic algorithms. FORK can be easily integrated into a model-free Actor-Critic algorithm. Our experiments on six Box2D and MuJoCo environments with continuous state and action spaces demonstrate significant performance improvement FORK can bring to the state-of-the-art algorithms. A variation of FORK can further solve Bipedal-WalkerHardcore in as few as four hours using a single GPU.

1. INTRODUCTION

Deep reinforcement learning has had tremendous successes, and sometimes even superhuman performance, in a wide range of applications including board games (Silver et al., 2016) , video games (Vinyals et al., 2019) , and robotics (Haarnoja et al., 2018a) . A key to these recent successes is the use of deep neural networks as high-capacity function approximators that can harvest a large amount of data samples to approximate high-dimensional state or action value functions, which tackles one of the most challenging issues in reinforcement learning problems with very large state and action spaces. Many modern reinforcement learning algorithms are model-free, so they are applicable in different environments and can readily react to new and unseen states. This paper considers model-free reinforcement learning for problems with continuous state and action spaces, in particular, the Actor-Critic method, where Critic evaluates the state or action values of the Actor's policy and Actor improves the policy based on the value estimation from Critic. To draw an analogy between Actor-Critic algorithms and human decision making, consider the scenario where a high school student is deciding on which college to attend after graduation. The student, like Actor, is likely to make her/his decision based on the perceived values of the colleges, where the value of a college is based on many factors including (i) the quality of education it offers, its culture, and diversity, which can be viewed as instantaneous rewards of attending the college; and (ii) the career opportunities after finishing the college, which can be thought as the future cumulative reward. We now take this analogy one step further, in human decision making, we often not only consider the "value" of current state and action, but also further forecast the outcome of the current decision and the value of the next state. In the example above, a student often explicitly takes into consideration the first job she/he may have after finishing college, and the "value" of the first job. Since forward-looking is common in human decision making, we are interested in understanding whether such forward-looking decision making can help Actor; in particular, whether it is useful for Actor to forecast the next state and use the value of future states to improve the policy. To our great surprise, a relative straightforward implementation of forward-looking Actor, as an add-on to existing Actor algorithms, improves Actor's performance by a large margin. Our new Actor, named FOrward-looKing Actor or FORK for short, mimics human decision making where we think multi-step ahead. In particular, FORK includes a neural network that forecasts the next state given the current state and current action, called system network; and a neural network that forecasts the reward given a (state, action) pair, called reward network. With the system network and reward network, 1

