DIFFERENTIABLE TRUST REGION LAYERS FOR DEEP REINFORCEMENT LEARNING

Abstract

Trust region methods are a popular tool in reinforcement learning as they yield robust policy updates in continuous and discrete action spaces. However, enforcing such trust regions in deep reinforcement learning is difficult. Hence, many approaches, such as Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO), are based on approximations. Due to those approximations, they violate the constraints or fail to find the optimal solution within the trust region. Moreover, they are difficult to implement, often lack sufficient exploration, and have been shown to depend on seemingly unrelated implementation choices. In this work, we propose differentiable neural network layers to enforce trust regions for deep Gaussian policies via closed-form projections. Unlike existing methods, those layers formalize trust regions for each state individually and can complement existing reinforcement learning algorithms. We derive trust region projections based on the Kullback-Leibler divergence, the Wasserstein L2 distance, and the Frobenius norm for Gaussian distributions. We empirically demonstrate that those projection layers achieve similar or better results than existing methods while being almost agnostic to specific implementation choices.

1. INTRODUCTION

Deep reinforcement learning has shown considerable advances in recent years with prominent application areas such as games (Mnih et al., 2015; Silver et al., 2017 ), robotics (Levine et al., 2015 ), and control (Duan et al., 2016) . In policy search, policy gradient (PG) methods have been highly successful and have gained, among others, great popularity (Peters & Schaal, 2008) . However, often it is difficult to tune learning rates for vanilla PG methods, because they tend to reduce the entropy of the policy too quickly. This results in a lack of exploration and, as a consequence, in premature or slow convergence. A common practice to mitigate these limitations is to impose a constraint on the allowed change between two successive policies. Kakade & Langford (2002) provided a theoretical justification for this in the approximate policy iteration setting. Two of the arguably most favored policy search algorithms, Trust Region Policy Optimization (TRPO) (Schulman et al., 2015a) and Proximal Policy Optimization (PPO) (Schulman et al., 2017) , follow this idea using the Kullback-Leibler divergence (KL) between successive policies as a constraint. We propose closed-form projections for Gaussian policies, realized as differentiable neural network layers. These layers constrain the change in successive policies by projecting the updated policy onto trust regions. First, this approach is more stable with respect to what Engstrom et al. ( 2020) refer to as code-level optimizations than other approaches. Second, it comes with the benefit of imposing constraints for individual states, allowing for the possibility of state-dependent trust regions. This allows us to constrain the state-wise maximum change of successive policies. In this we differ from previous works, that constrain only the expected change and thus cannot rely on exact guarantees of monotonic improvement. Furthermore, we propose three different similarity measures, the KL

