WHAT MATTERS FOR ON-POLICY DEEP ACTOR-CRITIC METHODS? A LARGE-SCALE STUDY

Abstract

In recent years, reinforcement learning (RL) has been successfully applied to many different continuous control tasks. While RL algorithms are often conceptually simple, their state-of-the-art implementations take numerous low-and high-level design decisions that strongly affect the performance of the resulting agents. Those choices are usually not extensively discussed in the literature, leading to discrepancy between published descriptions of algorithms and their implementations. This makes it hard to attribute progress in RL and slows down overall progress [27]. As a step towards filling that gap, we implement >50 such "choices" in a unified on-policy deep actor-critic framework, allowing us to investigate their impact in a large-scale empirical study. We train over 250'000 agents in five continuous control environments of different complexity and provide insights and practical recommendations for the training of on-policy deep actor-critic RL agents.

1. INTRODUCTION

Deep reinforcement learning (RL) has seen increased interest in recent years due to its ability to have neural-network-based agents learn to act in environments through interactions. For continuous control tasks, on-policy algorithms such as REINFORCE [2] , TRPO [10] , A3C [14] , PPO [17] and off-policy algorithms such as DDPG [13] and SAC [21] have enabled successful applications such as quadrupedal locomotion [20] , self-driving [30] or dexterous in-hand manipulation [20, 25, 32] . Many of these papers investigate in depth different algorithmic ideas, for example different loss functions and learning paradigms. Yet, it is less visible that behind successful experiments in deep RL there are complicated code bases that contain a large number of low-and high-level design decisions that are usually not discussed in research papers. While one may assume that such "choices" do not matter, there is evidence that they are in fact crucial for or even driving good performance [27] . While there are open-source implementations available that can be used by practitioners, this is still unsatisfactory: Research publications often contain one-to-one comparisons of different algorithmic ideas based on implementations in different code bases. This makes it impossible to assess whether improvements are due to the underlying algorithmic idea or due to the implementation. In fact, it is hard to assess the performance of high-level algorithmic ideas without an understanding of lower-level choices as performance may strongly depend on the tuning of hyperparameters and implementationlevel details. Overall, this makes it hard to attribute progress in reinforcement learning and slows down further research [15, 22, 27] . Our contributions. Our key goal in this paper is to investigate such lower level choices in depth and to understand their impact on final agent performance. Hence, as our key contributions, we (1) implement >50 choices in a unified on-policy deep actor-critic implementationfoot_0 , (2) conducted a large-scale (more than 250'000 agents trained) experimental study that covers different aspects of the training process, and (3) analyze the experimental results to provide practical insights and recommendations for the training of on-policy deep actor-critic RL agents. Most surprising finding. While many of our experimental findings confirm common RL practices, some of them are quite surprising, e.g. the policy initialization scheme significantly influences the performance while it is rarely even mentioned in RL publications. In particular, we have found that initializing the network so that the initial action distribution has zero mean, a rather low standard deviation and is independent of the observation significantly improves the training speed (Sec. 3.2). Paper outline. The rest of of this paper is structured as follows: We describe our experimental setup and performance metrics used in Sec. 2. Then, in Sec. 3 we present and analyse the experimental results and finish with related work in Sec. 4 and conclusions in Sec. 5. The appendices contain the detailed description of all design choices we experiment with (App. B), default hyperparameters (App. C) and the raw experimental results (App. D -K).

2. STUDY DESIGN

Considered setting. In this paper, we consider the setting of on-policy deep actor-critic reinforcement learning for continuous control. We define on-policy learning in the following loose sense: We consider policy iteration algorithms that iterate between generating experience using the current policy and using that experience to improve the policy. This is the standard modus operandi of algorithms usually considered on-policy such as PPO [17] . However, we note that algorithms often perform several model updates and thus may operate technically on off-policy data within a single policy improvement iteration. As benchmark environments, we consider five widely used continuous control environments from OpenAI Gym [12] of varying complexity: Hopper-v1, Walker2d-v1, HalfCheetah-v1, Ant-v1, and Humanoid-v1foot_1 . Unified on-policy deep actor-critic gradient algorithm. We took the following approach to create a highly configurable unified on-policy deep actor-critic gradient algorithm with as many choices as possible: 1. We researched prior work and popular code bases to make a list of commonly used choices, i.e., different loss functions (both for value functions and policies), architectural choices such as initialization methods, heuristic tricks such as gradient clipping and all their corresponding hyperparameters. 2. Based on this, we implemented a single, unified on-policy deep actor-critic agent and corresponding training protocol starting from the SEED RL code base [28] . Whenever we were faced with implementation decisions that required us to take decisions that could not be clearly motivated or had alternative solutions, we further added such decisions as additional choices. 3. We verified that when all choices are selected as in the PPO implementation from OpenAI baselines, we obtain similar performance as reported in the PPO paper [17] . We chose PPO because it is probably the most commonly used on-policy deep actor-critic RL algorithm at the moment. The resulting agent implementation is detailed in Appendix B. The key property is that the implementation exposes all choices as configuration options in an unified manner. For convenience, we mark each of the choice in this paper with a number (e.g., C1) and a fixed name (e.g. num_envs (C1)) that can be easily used to find a description of the choice in Appendix B. Difficulty of investigating choices. The primary goal of this paper is to understand how the different choices affect the final performance of an agent and to derive recommendations for these choices. There are two key reasons why this is challenging: First, we are mainly interested in insights on choices for good hyperparameter configurations. Yet, if all choices are sampled randomly, the performance is very bad and little (if any) training progress is made. This may be explained by the presence of sub-optimal settings (e.g., hyperparameters of the wrong scale) that prohibit learning at all. If there are many choices, the probability of such failure increases exponentially.



The implementation is available at https://github.com/google-research/seed_rl. It has been noticed that the version of the Mujoco physics simulator [5] can slightly influence the behaviour of some of the environmentshttps://github.com/openai/gym/issues/1541. We used Mujoco 2.0 in our experiments.

