REGULARIZATION MATTERS IN POLICY OPTIMIZATION -AN EMPIRICAL STUDY ON CONTINUOUS CONTROL

Abstract

Deep Reinforcement Learning (Deep RL) has been receiving increasingly more attention thanks to its encouraging performance on a variety of control tasks. Yet, conventional regularization techniques in training neural networks (e.g., L 2 regularization, dropout) have been largely ignored in RL methods, possibly because agents are typically trained and evaluated in the same environment, and because the deep RL community focuses more on high-level algorithm designs. In this work, we present the first comprehensive study of regularization techniques with multiple policy optimization algorithms on continuous control tasks. Interestingly, we find conventional regularization techniques on the policy networks can often bring large improvement, especially on harder tasks. We also compare these techniques with the more widely used entropy regularization. Our findings are shown to be robust against training hyperparameter variations. In addition, we study regularizing different components and find that only regularizing the policy network is typically the best. Finally, we discuss and analyze why regularization may help generalization in RL from four perspectives -sample complexity, return distribution, weight norm, and noise robustness. We hope our study provides guidance for future practices in regularizing policy optimization algorithms. Our code is available at https://github.com/xuanlinli17/iclr2021_rlreg.

1. INTRODUCTION

The use of regularization methods to prevent overfitting is a key technique in successfully training neural networks. Perhaps the most widely recognized regularization methods in deep learning are L 2 regularization (also known as weight decay) and dropout (Srivastava et al., 2014) . These techniques are standard practices in supervised learning tasks across many domains. Major tasks in computer vision, e.g., image classification (Krizhevsky et al., 2012; He et al., 2016) , object detection (Ren et al., 2015; Redmon et al., 2016) , use L 2 regularization as a default option. In natural language processing, for example, the Transformer (Vaswani et al., 2017) uses dropout, and the popular BERT model (Devlin et al., 2018) uses L 2 regularization. In fact, it is rare to see state-of-the-art neural models trained without regularization in a supervised setting. However, in deep reinforcement learning (deep RL), those conventional regularization methods are largely absent or underutilized in past research, possibly because in most cases we are maximizing the return on the same task as in training. In other words, there is no generalization gap from the training environment to the test environment (Cobbe et al., 2018) . Heretofore, researchers in deep RL have focused on high-level algorithm design and largely overlooked issues related to network training, including regularization. For popular policy optimization algorithms like Trust Region Policy Optimization (TRPO) (Schulman et al., 2015) , Proximal Policy Optimization (PPO) (Schulman et al., 2017) , and Soft Actor Critic (SAC) (Haarnoja et al., 2018) , conventional regularization methods were not considered. In popular codebases such as the OpenAI Baseline (Dhariwal et al., 2017) , L 2 regularization and dropout were not incorporated. Instead, a commonly used regularization in RL is the entropy regularization which penalizes the high-certainty output from the policy network to encourage more exploration and prevent the agent from overfitting to certain actions. The entropy regularization was first introduced by (Williams & Peng, 1991) and now used by many contemporary algorithms (Mnih et al., 2016; Schulman et al., 2017; Teh et al., 2017; Farebrother et al., 2018) . In this work, we take an empirical approach to assess the conventional paradigm which omits common regularization when learning deep RL models. We study agent performance on current task (the environment which the agent is trained on), rather than its generalization ability to different environments as in many recent works (Zhao et al., 2019; Farebrother et al., 2018; Cobbe et al., 2018) . We specifically focus our study on policy optimization methods, which are becoming increasingly popular and have achieved top performance on various tasks. We evaluate four popular policy optimization algorithms, namely SAC, PPO, TRPO, and the synchronous version of Advantage Actor Critic (A2C), on multiple continuous control tasks. Various conventional regularization techniques are considered, including L 2 /L 1 weight regularization, dropout, weight clipping (Arjovsky et al., 2017) and Batch Normalization (BN) (Ioffe & Szegedy, 2015) . We compare the performance of these regularization techniques to that without regularization, as well as the entropy regularization. Surprisingly, even though the training and testing environments are the same, we find that many of the conventional regularization techniques, when imposed to the policy networks, can still bring up the performance, sometimes significantly. Among those regularizers, L 2 regularization tends to be the most effective overall. L 1 regularization and weight clipping can boost performance in many cases. Dropout and Batch Normalization tend to bring improvements only on off-policy algorithms. Additionally, all regularization methods tend to be more effective on more difficult tasks. We also verify our findings with a wide range of training hyperparameters and network sizes, and the result suggests that imposing proper regularization can sometimes save the effort of tuning other training hyperparameters. We further study which part of the policy optimization system should be regularized, and conclude that generally only regularizing the policy network suffices, as imposing regularization on value networks usually does not help. Finally we discuss and analyze possible reasons for some experimental observations. Our main contributions can be summarized as follows: • To our best knowledge, we provide the first systematic study of common regularization methods in policy optimization, which have been largely ignored in the deep RL literature. • We find conventional regularizers can be effective on continuous control tasks (especially on harder ones) with statistical significance, under randomly sampled training hyperparameters. Interestingly, simple regularizers (L 2 , L 1 , weight clipping) could perform better than entropy regularization, with L 2 generally the best. BN and dropout can only help in off-policy algorithms. • We study which part of the network(s) should be regularized. The key lesson is to regularize the policy network but not the value network. • We analyze why regularization may help generalization in RL through sample complexity, return distribution, weight norm, and training noise robustness.

2. RELATED WORKS

Regularization in Deep RL. There have been many prior works studying the theory of regularization in policy optimization (Farahmand et al., 2009; Neu et al., 2017; Zhang et al., 2020) . In practice, conventional regularization methods have rarely been applied in deep RL. One rare case of such use is in Deep Deterministic Policy Gradient (DDPG) (Lillicrap et al., 2016) , where Batch Normalization is applied to all layers of the actor and some layers of the critic, and L 2 regularization is applied to the critic. Some recent studies have developed more complicated regularization approaches to continuous control tasks. 



Parisi et al., 2019). Also, these techniques consider regularizing the output of the network, while conventional methods mostly directly regularize the parameters. In this work, we focus on studying these simpler but under-utilized regularization methods.Generalization in Deep RL typically refers to how the model perform in a different environment from the one it is trained on. The generalization gap can come from different modes/levels/difficulties of a game(Farebrother et al., 2018), simulation vs. real world (Tobin et al., 2017), parameter  variations (Pattanaik et al., 2018), or different random seeds in environment generation(Zhang et al.,

