WHAT MATTERS FOR ON-POLICY DEEP ACTOR-CRITIC METHODS? A LARGE-SCALE STUDY

Abstract

In recent years, reinforcement learning (RL) has been successfully applied to many different continuous control tasks. While RL algorithms are often conceptually simple, their state-of-the-art implementations take numerous low-and high-level design decisions that strongly affect the performance of the resulting agents. Those choices are usually not extensively discussed in the literature, leading to discrepancy between published descriptions of algorithms and their implementations. This makes it hard to attribute progress in RL and slows down overall progress [27]. As a step towards filling that gap, we implement >50 such "choices" in a unified on-policy deep actor-critic framework, allowing us to investigate their impact in a large-scale empirical study. We train over 250'000 agents in five continuous control environments of different complexity and provide insights and practical recommendations for the training of on-policy deep actor-critic RL agents.

1. INTRODUCTION

Deep reinforcement learning (RL) has seen increased interest in recent years due to its ability to have neural-network-based agents learn to act in environments through interactions. For continuous control tasks, on-policy algorithms such as REINFORCE [2], TRPO [10], A3C [14] , PPO [17] and off-policy algorithms such as DDPG [13] and SAC [21] have enabled successful applications such as quadrupedal locomotion [20] , self-driving [30] or dexterous in-hand manipulation [20, 25, 32] . Many of these papers investigate in depth different algorithmic ideas, for example different loss functions and learning paradigms. Yet, it is less visible that behind successful experiments in deep RL there are complicated code bases that contain a large number of low-and high-level design decisions that are usually not discussed in research papers. While one may assume that such "choices" do not matter, there is evidence that they are in fact crucial for or even driving good performance [27] . While there are open-source implementations available that can be used by practitioners, this is still unsatisfactory: Research publications often contain one-to-one comparisons of different algorithmic ideas based on implementations in different code bases. This makes it impossible to assess whether improvements are due to the underlying algorithmic idea or due to the implementation. In fact, it is hard to assess the performance of high-level algorithmic ideas without an understanding of lower-level choices as performance may strongly depend on the tuning of hyperparameters and implementationlevel details. Overall, this makes it hard to attribute progress in reinforcement learning and slows down further research [15, 22, 27] . Our contributions. Our key goal in this paper is to investigate such lower level choices in depth and to understand their impact on final agent performance. Hence, as our key contributions, we (1) implement >50 choices in a unified on-policy deep actor-critic implementationfoot_0 , (2) conducted a large-scale (more than 250'000 agents trained) experimental study that covers different aspects of the training process, and (3) analyze the experimental results to provide practical insights and recommendations for the training of on-policy deep actor-critic RL agents.



The implementation is available at https://github.com/google-research/seed_rl. 1

