REGULARIZATION MATTERS IN POLICY OPTIMIZATION -AN EMPIRICAL STUDY ON CONTINUOUS CONTROL

Abstract

Deep Reinforcement Learning (Deep RL) has been receiving increasingly more attention thanks to its encouraging performance on a variety of control tasks. Yet, conventional regularization techniques in training neural networks (e.g., L 2 regularization, dropout) have been largely ignored in RL methods, possibly because agents are typically trained and evaluated in the same environment, and because the deep RL community focuses more on high-level algorithm designs. In this work, we present the first comprehensive study of regularization techniques with multiple policy optimization algorithms on continuous control tasks. Interestingly, we find conventional regularization techniques on the policy networks can often bring large improvement, especially on harder tasks. We also compare these techniques with the more widely used entropy regularization. Our findings are shown to be robust against training hyperparameter variations. In addition, we study regularizing different components and find that only regularizing the policy network is typically the best. Finally, we discuss and analyze why regularization may help generalization in RL from four perspectives -sample complexity, return distribution, weight norm, and noise robustness. We hope our study provides guidance for future practices in regularizing policy optimization algorithms. Our code is available at https://github.com/xuanlinli17/iclr2021_rlreg.

1. INTRODUCTION

The use of regularization methods to prevent overfitting is a key technique in successfully training neural networks. Perhaps the most widely recognized regularization methods in deep learning are L 2 regularization (also known as weight decay) and dropout (Srivastava et al., 2014) . These techniques are standard practices in supervised learning tasks across many domains. Major tasks in computer vision, e.g., image classification (Krizhevsky et al., 2012; He et al., 2016 ), object detection (Ren et al., 2015; Redmon et al., 2016) , use L 2 regularization as a default option. In natural language processing, for example, the Transformer (Vaswani et al., 2017) uses dropout, and the popular BERT model (Devlin et al., 2018) uses L 2 regularization. In fact, it is rare to see state-of-the-art neural models trained without regularization in a supervised setting. However, in deep reinforcement learning (deep RL), those conventional regularization methods are largely absent or underutilized in past research, possibly because in most cases we are maximizing the return on the same task as in training. In other words, there is no generalization gap from the training environment to the test environment (Cobbe et al., 2018) . Heretofore, researchers in deep RL have focused on high-level algorithm design and largely overlooked issues related to network training, including regularization. For popular policy optimization algorithms like Trust Region Policy Optimization (TRPO) (Schulman et al., 2015) , Proximal Policy Optimization (PPO) (Schulman et al., 2017) , and Soft Actor Critic (SAC) (Haarnoja et al., 2018) , conventional regularization methods were not considered. In popular codebases such as the OpenAI Baseline (Dhariwal et al., 2017) , L 2 regularization and dropout were not incorporated. Instead, a commonly used regularization in RL is the entropy regularization which penalizes the high-certainty output from the policy network to encourage more exploration and prevent the agent from overfitting to certain actions. The entropy

