DIFFERENTIABLE TRUST REGION LAYERS FOR DEEP REINFORCEMENT LEARNING

Abstract

Trust region methods are a popular tool in reinforcement learning as they yield robust policy updates in continuous and discrete action spaces. However, enforcing such trust regions in deep reinforcement learning is difficult. Hence, many approaches, such as Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO), are based on approximations. Due to those approximations, they violate the constraints or fail to find the optimal solution within the trust region. Moreover, they are difficult to implement, often lack sufficient exploration, and have been shown to depend on seemingly unrelated implementation choices. In this work, we propose differentiable neural network layers to enforce trust regions for deep Gaussian policies via closed-form projections. Unlike existing methods, those layers formalize trust regions for each state individually and can complement existing reinforcement learning algorithms. We derive trust region projections based on the Kullback-Leibler divergence, the Wasserstein L2 distance, and the Frobenius norm for Gaussian distributions. We empirically demonstrate that those projection layers achieve similar or better results than existing methods while being almost agnostic to specific implementation choices.

1. INTRODUCTION

Deep reinforcement learning has shown considerable advances in recent years with prominent application areas such as games (Mnih et al., 2015; Silver et al., 2017) , robotics (Levine et al., 2015) , and control (Duan et al., 2016) . In policy search, policy gradient (PG) methods have been highly successful and have gained, among others, great popularity (Peters & Schaal, 2008) . However, often it is difficult to tune learning rates for vanilla PG methods, because they tend to reduce the entropy of the policy too quickly. This results in a lack of exploration and, as a consequence, in premature or slow convergence. A common practice to mitigate these limitations is to impose a constraint on the allowed change between two successive policies. Kakade & Langford (2002) provided a theoretical justification for this in the approximate policy iteration setting. Two of the arguably most favored policy search algorithms, Trust Region Policy Optimization (TRPO) (Schulman et al., 2015a) and Proximal Policy Optimization (PPO) (Schulman et al., 2017) , follow this idea using the Kullback-Leibler divergence (KL) between successive policies as a constraint. We propose closed-form projections for Gaussian policies, realized as differentiable neural network layers. These layers constrain the change in successive policies by projecting the updated policy onto trust regions. First, this approach is more stable with respect to what Engstrom et al. (2020) refer to as code-level optimizations than other approaches. Second, it comes with the benefit of imposing constraints for individual states, allowing for the possibility of state-dependent trust regions. This allows us to constrain the state-wise maximum change of successive policies. In this we differ from previous works, that constrain only the expected change and thus cannot rely on exact guarantees of monotonic improvement. Furthermore, we propose three different similarity measures, the KL divergence, the Wasserstein L2 distance, and the Frobenius norm, to base our trust region approach on. The last layer of the projected policy is now the the trust region layer which relies on the old policy as input. This would result in a ever-growing stack of policies, rendering this approach clearly infeasible. To circumvent this issue we introduce a penalty term into the reinforcement learning objective to ensure the input and output of the projection stay close together. While this still results in an approximation of the trust region update, we show that the trust regions are properly enforced. We also extend our approach to allow for a controlled evolution of the entropy of the policy, which has been shown to increase the performance in difficult exploration problems (Pajarinen et al., 2019; Akrour et al., 2019) . We compare and discuss the effect of the different similarity measures as well as the entropy control on the optimization process. Additionally, we benchmark our algorithm against existing methods and demonstrate that we achieve similar or better performance.

2. RELATED WORK

Approximate Trust Regions. Bounding the size of the policy update in policy search is a common approach. While Kakade & Langford (2002) originally focused on a method based on mixing policies, nowadays most approaches use KL trust regions to bound the updates. Peters et al. (2010) proposed a first approach to such trust regions by formulating the problem as a constraint optimization and provided a solution based on the dual of that optimization problem. Still, this approach is not straightforwardly extendable to highly non-linear policies, such as neural networks. In an attempt to transfer those ideas to deep learning, TRPO (Schulman et al., 2015a) approximates the KL constraint using the Fisher information matrix and natural policy gradient updates (Peters & Schaal, 2008; Kakade, 2001) , along with a backtracking line search to enforce a hard KL constraint. Yet, the resulting algorithm scales poorly. Thus, Schulman et al. ( 2017) introduced PPO, which does not directly enforce the KL trust region, but clips the probability ratio in the importance sampling objective. This allows using efficient first-order optimization methods while maintaining robust training. However, Engstrom et al. ( 2020) and Andrychowicz et al. ( 2020) recently showed that implementation choices are essential for achieving state-of-the-art results with PPO. Code-level optimizations, such as reward scaling as well as value function, observation, reward, and gradient clipping, can even compensate for removing core parts of the algorithm, e. g. the clipping of the probability ratio. Additionally, PPO heavily relies on its exploration behavior and might get stuck in local optima (Wang et al., 2019) . Tangkaratt et al. ( 2018) use a closed-form solution for the constraint optimization based on the method of Lagrangian multipliers. They, however, require a quadratic parametrization of the Q-Function, which can limit the performance. Pajarinen et al. ( 2019) introduced an approach based on compatible value function approximations to realize KL trust regions. Based on the reinforcement learning as inference paradigm (Levine, 2018), Abdolmaleki et al. ( 2018) introduced an actor-critic approach using an Expectation-Maximization based optimization with KL trust regions in both the E-step and M-step. Song et al. (2020) proposed an on-policy version of this approach using a similar optimization scheme and constraints. Projections for Trust Regions. Akrour et al. (2019) proposed Projected Approximate Policy Iteration (PAPI), a projection-based solution to implement KL trust regions. Their method projects an intermediate policy, that already satisfies the trust region constraint, onto the constraint bounds. This maximizes the size of the update step. However, PAPI relies on other trust region methods to generate this intermediary policy and cannot operate in a stand-alone setting. Additionally, the projection is not directly part of the policy optimization but applied afterwards, which can result in sub-optimal policies. In context of computational complexity, both, TRPO and PAPI, simplify the constraint by leveraging the expected KL divergence. Opposed to that, we implement the projections as fully differentiable network layers and directly include them in the optimization process. Additionally, our projections enforce the constraints per state. This allows for better control of the change between subsequent policies and for state-dependent trust regions. For the KL-based projection layer we need to resort to numerical optimization and implicit gradients for convex optimizations (Amos & Kolter, 2017; Agrawal et al., 2019) . Thus, we investigate two alternative projections based on the Wasserstein L2 and Frobenius norm, which allow for closed form solutions. Both, Wasserstein and Frobenius norm, have found only limited applications in reinforcement learning. Pacchiano et al. (2020) use the Wasserstein distance to score behaviors of

