DIFFERENTIABLE TRUST REGION LAYERS FOR DEEP REINFORCEMENT LEARNING

Abstract

Trust region methods are a popular tool in reinforcement learning as they yield robust policy updates in continuous and discrete action spaces. However, enforcing such trust regions in deep reinforcement learning is difficult. Hence, many approaches, such as Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO), are based on approximations. Due to those approximations, they violate the constraints or fail to find the optimal solution within the trust region. Moreover, they are difficult to implement, often lack sufficient exploration, and have been shown to depend on seemingly unrelated implementation choices. In this work, we propose differentiable neural network layers to enforce trust regions for deep Gaussian policies via closed-form projections. Unlike existing methods, those layers formalize trust regions for each state individually and can complement existing reinforcement learning algorithms. We derive trust region projections based on the Kullback-Leibler divergence, the Wasserstein L2 distance, and the Frobenius norm for Gaussian distributions. We empirically demonstrate that those projection layers achieve similar or better results than existing methods while being almost agnostic to specific implementation choices.

1. INTRODUCTION

Deep reinforcement learning has shown considerable advances in recent years with prominent application areas such as games (Mnih et al., 2015; Silver et al., 2017) , robotics (Levine et al., 2015) , and control (Duan et al., 2016) . In policy search, policy gradient (PG) methods have been highly successful and have gained, among others, great popularity (Peters & Schaal, 2008) . However, often it is difficult to tune learning rates for vanilla PG methods, because they tend to reduce the entropy of the policy too quickly. This results in a lack of exploration and, as a consequence, in premature or slow convergence. A common practice to mitigate these limitations is to impose a constraint on the allowed change between two successive policies. Kakade & Langford (2002) provided a theoretical justification for this in the approximate policy iteration setting. Two of the arguably most favored policy search algorithms, Trust Region Policy Optimization (TRPO) (Schulman et al., 2015a) and Proximal Policy Optimization (PPO) (Schulman et al., 2017) , follow this idea using the Kullback-Leibler divergence (KL) between successive policies as a constraint. We propose closed-form projections for Gaussian policies, realized as differentiable neural network layers. These layers constrain the change in successive policies by projecting the updated policy onto trust regions. First, this approach is more stable with respect to what Engstrom et al. (2020) refer to as code-level optimizations than other approaches. Second, it comes with the benefit of imposing constraints for individual states, allowing for the possibility of state-dependent trust regions. This allows us to constrain the state-wise maximum change of successive policies. In this we differ from previous works, that constrain only the expected change and thus cannot rely on exact guarantees of monotonic improvement. Furthermore, we propose three different similarity measures, the KL divergence, the Wasserstein L2 distance, and the Frobenius norm, to base our trust region approach on. The last layer of the projected policy is now the the trust region layer which relies on the old policy as input. This would result in a ever-growing stack of policies, rendering this approach clearly infeasible. To circumvent this issue we introduce a penalty term into the reinforcement learning objective to ensure the input and output of the projection stay close together. While this still results in an approximation of the trust region update, we show that the trust regions are properly enforced. We also extend our approach to allow for a controlled evolution of the entropy of the policy, which has been shown to increase the performance in difficult exploration problems (Pajarinen et al., 2019; Akrour et al., 2019) . We compare and discuss the effect of the different similarity measures as well as the entropy control on the optimization process. Additionally, we benchmark our algorithm against existing methods and demonstrate that we achieve similar or better performance.

2. RELATED WORK

Approximate Trust Regions. Bounding the size of the policy update in policy search is a common approach. While Kakade & Langford (2002) originally focused on a method based on mixing policies, nowadays most approaches use KL trust regions to bound the updates. Peters et al. (2010) proposed a first approach to such trust regions by formulating the problem as a constraint optimization and provided a solution based on the dual of that optimization problem. Still, this approach is not straightforwardly extendable to highly non-linear policies, such as neural networks. In an attempt to transfer those ideas to deep learning, TRPO (Schulman et al., 2015a) approximates the KL constraint using the Fisher information matrix and natural policy gradient updates (Peters & Schaal, 2008; Kakade, 2001) , along with a backtracking line search to enforce a hard KL constraint. Yet, the resulting algorithm scales poorly. Thus, Schulman et al. (2017) introduced PPO, which does not directly enforce the KL trust region, but clips the probability ratio in the importance sampling objective. This allows using efficient first-order optimization methods while maintaining robust training. However, Engstrom et al. (2020) and Andrychowicz et al. (2020) recently showed that implementation choices are essential for achieving state-of-the-art results with PPO. Code-level optimizations, such as reward scaling as well as value function, observation, reward, and gradient clipping, can even compensate for removing core parts of the algorithm, e. g. the clipping of the probability ratio. Additionally, PPO heavily relies on its exploration behavior and might get stuck in local optima (Wang et al., 2019) . Tangkaratt et al. (2018) use a closed-form solution for the constraint optimization based on the method of Lagrangian multipliers. They, however, require a quadratic parametrization of the Q-Function, which can limit the performance. Pajarinen et al. (2019) introduced an approach based on compatible value function approximations to realize KL trust regions. Based on the reinforcement learning as inference paradigm (Levine, 2018) , Abdolmaleki et al. (2018) introduced an actor-critic approach using an Expectation-Maximization based optimization with KL trust regions in both the E-step and M-step. Song et al. (2020) proposed an on-policy version of this approach using a similar optimization scheme and constraints. Projections for Trust Regions. Akrour et al. (2019) proposed Projected Approximate Policy Iteration (PAPI), a projection-based solution to implement KL trust regions. Their method projects an intermediate policy, that already satisfies the trust region constraint, onto the constraint bounds. This maximizes the size of the update step. However, PAPI relies on other trust region methods to generate this intermediary policy and cannot operate in a stand-alone setting. Additionally, the projection is not directly part of the policy optimization but applied afterwards, which can result in sub-optimal policies. In context of computational complexity, both, TRPO and PAPI, simplify the constraint by leveraging the expected KL divergence. Opposed to that, we implement the projections as fully differentiable network layers and directly include them in the optimization process. Additionally, our projections enforce the constraints per state. This allows for better control of the change between subsequent policies and for state-dependent trust regions. For the KL-based projection layer we need to resort to numerical optimization and implicit gradients for convex optimizations (Amos & Kolter, 2017; Agrawal et al., 2019) . Thus, we investigate two alternative projections based on the Wasserstein L2 and Frobenius norm, which allow for closed form solutions. Both, Wasserstein and Frobenius norm, have found only limited applications in reinforcement learning. Pacchiano et al. (2020) use the Wasserstein distance to score behaviors of agents. Richemond & Maginnis (2017) proposed an alternative algorithm for bandits with Wasserstein based trust regions. Song & Zhao (2020) focus on solving the trust region problem for distributional policies using both KL and Wasserstein based trust regions for discrete action spaces. Our projections are applicable independently of the underlying algorithm and only assume a Gaussian policy, a common assumption for continuous action spaces. Several authors (Dalal et al., 2018; Chow et al., 2019; Yang et al., 2020) used projections as network layers to enforce limitations in the action or state space given environmental restrictions, such as robotic joint limits. Entropy Control. Abdolmaleki et al. (2015) introduced the idea of explicitly controlling the decrease in entropy during the optimization process, which later was extended to deep reinforcement learning by Pajarinen et al. (2019) and Akrour et al. (2019) . They use either an exponential or linear decay of the entropy during policy optimization to control the exploration process and escape local optima. To leverage those benefits, we embed this entropy control mechanism in our differentiable trust region layers.

3. PRELIMINARIES AND PROBLEM STATEMENT

We consider the general problem of a policy search in a Markov Decision Process (MDP) defined by the tuple (S, A, T , R, P 0 , γ). We assume the state space S and action space A are continuous and the transition probabilities T : S × A × S → [0, 1] describe the probability transitioning to state s t+1 ∈ S given the current state s t ∈ S and action a t ∈ A. We denote the initial state distributions as P 0 : S → [0, 1]. The reward returned by the environment is given by a function R : S × A → R and γ ∈ [0, 1) describes the discount factor. Our goal is to maximize the expected accumulated discounted reward R γ = E T ,P0,π [ ∞ t=0 γ t R(s t , a t )]. To find the optimal policy, traditional PG methods often make use of the likelihood ratio gradient and an importance sampling estimator. Moreover, instead of directly optimizing the returns, it has been shown to be more effective to optimize the advantage function as this results in an unbiased estimator of the gradient with less variance max θ Ĵ(π θ , π θold ) = max θ E (s,a)∼π θ old π θ (a|s) π θold (a|s) A π θ old (s, a) , where A π (s, a) = E [R γ |s 0 = s, a 0 = a; π] -E [R γ |s 0 = s; π] describes the advantage function, and the expectation is w.r.t π θold , i.e. s ∼ T (•|s, a), a ∼ π θold (•|s), s 0 ∼ P 0 (s 0 ), s ∼ ρ π θ old where ρ π θ old is a stationary distribution of policy π θold . The advantage function is commonly estimated bygeneralized advantage estimation (GAE) (Schulman et al., 2015b) . Trust region methods use additional constraints for the given objective. Using a constraint on the maximum KL over the states has been shown to guarantee monotonic improvement of the policy (Schulman et al., 2015a) . However, since all current approaches do not use a maximum KL constraint but an expected KL constraint, the guarantee of monotonic improvement does not hold exactly either. We are not aware of such results for the W2 distance or the Frobenius norm. For our projections we assume Gaussian policies π θold (a t |s t ) = N (a t |µ old (s t ), Σ old (s t )) and π θ (a t |s t ) = N (a t |µ(s t ), Σ(s t )) represent the old as well as the current policy, respectively. We explore three trust regions on top of Equation 1 that employ different similarity measures between old and new distributions, more specifically the frequently used reverse KL divergence, the Wasserstein L2 distance, and the Frobenius norm. Reverse KL Divergence. The KL divergence between two Gaussian distributions with means µ 1 and µ 2 and covariances Σ 1 and Σ 2 can generally be written as d , where d is the dimensionality of µ 1 , µ 2 . The KL uses the Mahalanobis distance to measure the similarity between the two mean vectors. The difference of the covariances is measured by the difference in shape, i.e., the difference in scale, given by the log ratio of the determinants, plus the difference in rotation, given by the trace term. Given the KL is non-symmetric, it is clearly not a distance, yet still a frequently used divergence between distributions. We will use the more common reverse KL for our trust region, where the first argument is the new policy and the second is the old policy. KL({µ 1 , Σ 1 } {µ 2 , Σ 2 }) = 1 2 (µ 2 -µ 1 ) T Σ -1 2 (µ 2 -µ 1 ) + log |Σ 2 | |Σ 1 | + tr{Σ -1 2 Σ 1 } - Wasserstein Distance. The Wasserstein distance is a distance measure based on an optimal transport formulation, for more details see Villani (2008) . The Wasserstein-2 distance for two Gaussian distributions can generally be written as W 2 ({µ 1 , Σ 1 } , {µ 2 , Σ 2 }) = |µ 1 -µ 2 | 2 + tr Σ 1 + Σ 2 -2 Σ 1 /2 2 Σ 1 Σ 1 /2 2 1 /2 . A key difference to the KL divergence is that the Wasserstein distance is a symmetric distance measure, i. e., W 2 (q, p) = W 2 (p, q). Our experiments also revealed that it is beneficial to measure the W2 distance in a metric space defined by the covariance of the old policy distribution, denoted here as Σ 2 , as the distance measure is then more sensitive to the data-generating distribution. The W2 distance in this metric space reads W 2,Σ2 ({µ 1 , Σ 1 } , {µ 2 , Σ 2 }) =(µ 2 -µ 1 ) T Σ -1 2 (µ 2 -µ 1 ) + tr Σ -1 2 Σ 1 + I -2Σ -1 2 Σ 1 /2 2 Σ 1 Σ 1 /2 2 1 /2 . Frobenius Norm. The Frobenius norm is a matrix norm and can directly be applied to the difference of the covariance matrices of the Gaussian distributions. To measure the distance of the mean vectors, we will, similar to the KL divergence, employ the Mahalanobis distance as this empirically leads to an improved performance in comparison to just taking the squared distance. Hence, we will denote the following metric as Frobenius norm between two Gaussian distributions F ({µ 1 , Σ 1 } , {µ 2 , Σ 2 }) = (µ 2 -µ 1 ) T Σ -1 2 (µ 2 -µ 1 ) + tr (Σ 2 -Σ 1 ) T (Σ 2 -Σ 1 ) . The Frobenius norm also constitutes a symmetric distance measure.

4. DIFFERENTIABLE TRUST-REGION LAYERS FOR GAUSSIAN POLICIES

We present projections based on the three similarity measures, i. e., Frobenius norm, Wasserstein L2 distance, and KL divergence. These projections realize state-wise trust regions and can directly be integrated in the optimization process as differentiable neural network layers. Additionally, we extend the trust region layers to include an entropy constraint to gain control over the evolution of the policy entropy during optimization. The trust regions are defined by a distance or divergence d(π(•|s), π old (•|s)) between probability distributions. Complementing Equation 1 with the trust region constraint leads to max θ Ĵ(π θ , π θold ) s.t. d(π θold (•|s)), π θ (•|s)) ≤ ∀s ∈ S. While, in principle, we want to enforce the constraint for every possible state, in practice, we can only enforce them for states sampled from rollouts of the current policy. To solve the problem in Equation 2, a standard neural network will output the parameters µ, Σ of a Gaussian distribution π θ , ignoring the trust region bounds. These parameters are provided to the trust region layers, together with the mean and covariance of the old policy and a parameter specifying the size of the trust region . The new policy is then given by the output of the trust region layer. Since the old policy distribution is fixed, all distances or divergences used in this paper can be decomposed into a mean and a covariance dependent part. This enables us to use separate trust regions as well as bounds for mean and covariance, allowing for more flexibility in the algorithm. The trust region layers aim to project π θ into the trust region by finding parameters μ and Σ that are closest to the original parameters µ and Σ while satisfying the trust region constraints. The projection is based on the same distance or divergence which was used to define the respective trust region. Formally, this corresponds to the following optimization problems for each s arg min μs d mean (μ s , µ(s)) , s.t. d mean (μ s , µ old (s)) ≤ µ , and arg min Σs d cov Σs , Σ(s) , s.t. d cov Σs , Σ old (s) ≤ Σ , where μs and Σs are the optimization variables for state s. Here, d mean is the mean dependent part and d cov is the covariance dependent part of the employed distance or divergence. For brevity of notation we will neglect all dependencies on the state in the following. We denote the projected policy as π(a|s) = N (a|μ, Σ).

4.1. PROJECTION OF THE MEAN

For all three trust region objectives we make use of the same distance measure for the mean, the Mahalanobis distance. Hence, the optimization problem for the mean is given by arg min μ (µ -μ) T Σ -1 old (µ -μ) s.t. (µ old -μ) T Σ -1 old (µ old -μ) ≤ µ . By making use of the method of Lagrangian multipliers (see Appendix B.2), we can formulate the dual and solve it for the projected mean μ as μ = µ + ωµ old 1 + ω with ω = (µ old -µ) T Σ -1 old (µ old -µ) µ -1. This equation can directly be used as mean for the Gaussian policy, while it easily allows to compute gradients. Note, that for the mean part of the KL we would need to use the Σ -foot_0 instead of Σ -1 old in the objective of Equation 5. Yet, this objective still results in a valid trust region problem which is much easier to optimize.

4.2. PROJECTION OF THE COVARIANCE

Frobenius Projection. The Frobenius projection formalizes the trust region for the covariance with the squared Frobenius norm of the matrix difference, which yields arg min Σ tr (Σ -Σ) T (Σ -Σ) , s.t. tr (Σ old -Σ) T (Σ old -Σ) ≤ Σ . We again use the method of Lagrangian multipliers (see Appendix B.3) and get the covariance Σ as Σ = Σ + ηΣ old 1 + η with η = tr ((Σ old -Σ) T (Σ old -Σ)) Σ -1, where η is the corresponding Lagrangian multiplier. Wasserstein Projection. Deriving the Wasserstein projection follows the same procedure. We obtain the following optimization problem arg min Σ tr Σ -1 old Σ + Σ -1 old Σ -2Σ -1 old Σ 1 /2 ΣΣ 1 /2 1 /2 , s.t. tr I + Σ -1 old Σ -2Σ -1 old Σ 1 /2 old ΣΣ 1 /2 old 1 /2 ≤ Σ , where I is the identity matrix. A closed form solution to this optimization problem can be found by using the methods outlined in Takatsu (2011) . However, we found the resulting solution for the projected covariance matrices to be numerically unstable. Therefore, we made the simplifying assumption that both the current Σ and the old covariance Σ old commute with Σ. Under the common premise of diagonal covariances, this commutativity assumption always holds. For the more general case of arbitrary covariance matrices, we would need to ensure the matrices are sufficiently close together, which is effectively ensured by Equation 8. Again, we introduce Lagrange multipliers and solve the dual problem to obtain the optimal primal and dual variables (see Appendix B.4). Note however, that here we chose the square root of the covariance matrix 1 as primal variable. The corresponding projection for the square root covariance Σ1 /2 is then Σ1 /2 = Σ 1 /2 + ηΣ 1 /2 old 1 + η with η = tr I + Σ -1 old Σ -2Σ -1 /2 old Σ 1 /2 Σ -1, ( ) where η is the corresponding Lagrangian multiplier. We see the same pattern emerging as for the Frobenius projection. The chosen similarity measure reappears in the expression for the Lagrangian multiplier and the primal variables are weighted averages of the corresponding parameters of the old and the predicted Gaussian. KL Projection. Identically to the previous two projections, we reformulate Equation 4as arg min Σ tr Σ -1 Σ + log |Σ| | Σ| , s.t. tr Σ -1 old Σ -d + log |Σ old | | Σ| ≤ Σ , ( ) where d is the dimensionality of the action space. It is impossible to acquire a fully closed form solution for this problem. However, following Abdolmaleki et al. (2015) , we can obtain the projected precision Λ = Σ-1 by interpolation between the precision matrices of the old policy π old and the current policy π Λ = η * Λ old + Λ η * + 1 , η * = arg min η g(η), s.t. η ≥ 0, where η is the corresponding Lagrangian multiplier and g(η) the dual function. While this dual cannot be solved in closed form, an efficient solution exists using a standard numerical optimizer, such as BFGS, since it is a 1-D convex optimization. Regardless, we want a differentiable projection and thus also need to backpropagate the gradients through the numerical optimization. To this end, we follow Amos & Kolter (2017) and compute those gradients by taking the differentials of the KKT conditions of the dual. We refer to Appendix B.5 for more details and derivations. Entropy Control. Previous works (Akrour et al., 2019; Abdolmaleki et al., 2015) have shown the benefits of introducing an entropy constraint H(π θ ) ≥ β in addition to the trust region constraints. Such a constraint allows for more control over the exploration behavior of the policy. In order to endow our algorithm with this improved exploration behavior, we make use of the results from Akrour et al. (2019) and scale the standard deviation of the Gaussian distribution with a scalar factor exp {(β -H(π θ )) /d}, which can also be individually computed per state.

4.3. ANALYSIS OF THE PROJECTIONS

It is instructive to compare the three projections. The covariance update is an interpolation for all three projections, but the quantities that are interpolated differ. For the Frobenius projection we directly interpolate between the old and current covariances (Equation 7), for the W2 projection between their respective matrix square roots (Equation 9), and for the KL projection between their inverses (Equation 11). In other words, each projection suggests which parametrization to use for the covariance matrix. The different interpolations also have an interesting effect on the entropy of the resulting covariances which can be observed in Figure 1 . Further, we can prove the following theorem about the entropy of the projected distributions Theorem 1 Let π θ and π θold be Gaussian and η ≥ 0, then for the entropy of the projected distribution H(π) it holds that H(π) ≥ minimum(H(π θ ), H(π θold )) for the Frobenius (Equation 7) and the Wasserstein projection (Equation 9), as well as, H(π) ≤ maximum(H(π θ ), H(π θold )) for the KL projection (Equation 11). The proof is based on the multiplicative version of the Brunn-Minkowski inequality and can be found in Appendix B.1. Intuitively, this implies that the Frobenius and Wasserstein projections act more aggressively, i. e., they rather yield a higher entropy, while the KL projection acts more conservatively, i. e., it rather yields a smaller entropy. This could also explain why many KL based trust region methods lose entropy too quickly and converge prematurely. By introducing an explicit entropy control, those effects can be mitigated.

4.4. SUCCESSIVE POLICY UPDATES

The above projections can directly be implemented for training the current policy. Note, however, that at each epoch i the policy π i predicted by the network before the projection layer does not respect the constraints and thus relies on calling this layer. The policy of the projection layer πi not only depends on the parameters of π i but also on the old policy network π i,old = πi-1 . This would result in an ever-growing stack of policy networks becoming increasingly costly to evaluate. In other words, πi is computed using all stored networks of π i , π i-1 , . . . , π 0 . We now discuss the parametrization of π via amortized optimization. We need to encode the information of the projection layer into the parameters θ of the next policy, i.e. π(a|s; θ) = p•π θ (a|s) is a composition function in which p denotes the projection layer. The output of π θ is (µ, Σ), and p computes (μ, Σ) according Equations 6, 7, 9, or 11. Formally, we aim to find a set of parameters θ * = arg min θ E s∼ρπ old [d (π(•|s), π θ (•|s))] , where ρ πold is the state distribution of the old policy and d is the similarity measure used for the projection, such that we minimize the expected distance or divergence between the projection and the current policy prediction. The most intuitive way to solve this problem is to use the existing samples for additional regression steps after the policy optimization. Still, this adds a computational overhead. Therefore, we propose to concurrently optimize both objectives during training by penalizing the main objective, i. e., arg min θ E (s,a)∼π θ old π(a|s; θ) π θold (a|s) A πold (s, a) -αE s∼pπ old [d (π(•|s; θ), π θ (•|s))] . Note that the importance sampling ratio is computed based on a Gaussian distribution generated by the trust region layer and not directly from the network output. Furthermore, the gradient for the regression penalty does not flow through the projection, it is solely acting as supervised learning signal. As appropriate similarity measures d for the penalty, we resort to the measures used in each projection. For a detailed algorithmic view see Appendix A.

5. EXPERIMENTS

Mujoco Benchmarks We evaluate the performance of our trust region layers regarding sample complexity and final reward in comparison to PAPI and PPO on the OpenAI gym benchmark suite (Brockman et al., 2016) . We explicitly did not include TRPO in the evaluation, as Engstrom et al. (2020) showed that it can can achieve similar performance to PPO. For our experiments, the PAPI projection and its conservative PPO version are executed in the setting sent to us by the author. The hyperparameters for all three projections and PPO have been selected with Optuna (Akiba et al., 2019) . See Appendix D for a full listing of all hyperparameters. We use a shared set of hyperparameters for all environments except for the Humanoid, which we optimized separately. Next to the standard PPO implementation with all code-level optimizations we further evaluate PPO-M, which only leverages the core PPO algorithm. Our projections and PPO-M solely use the observation normalization, network architecture, and initialization from the original PPO implementation. All algorithms parametrize the covariance as a non-contextual diagonal matrix. We refer to the Frobenius projection as FROB, the Wasserstein projection as W2, and the KL projection as KL. Table 1 gives an overview of the final performance and convergence speed on the Mujoco benchmarks, Figure 4 in the appendix displays the full learning curves. After each epoch, we evaluate five episodes without applying exploration noise to obtain the return values. Note that we initially do not include the entropy projection to provide a fair comparison to PPO. The results show that our trust region layers are able to perform similarly or better than PPO and PAPI across all tasks. While the performance on the Hopper-v2 is comparable, the projections significantly outperform all baselines on the HalfCheetah-v2. The KL projection even demonstrates the best performance on the remaining three environments. Besides that, the experiments present a relatively balanced performance between projections, PPO, and PAPI. The differences are more apparent when comparing the projections to PPO-M, which uses the same implementation details as our projections. The asymptotic performance of PPO-M is on par for the Humanoid-v2, but it convergences much slower and is noticeably weaker on the remaining tasks. Consequently, the approximate trust region of PPO alone is not sufficient for good performance, only paired with certain implementation choices. Still, the original PPO cannot fully replace a mathematically sound trust region as ours, although it does not exhibit a strong performance difference. For this, Figure 2 visualizes the mean KL divergence at the end of each epoch for all methods. Despite the fact that neither W2 nor Frobenius projection use the KL, we leverage it here as a standardizing measure to compare the change in the policy distributions. All projections are characterized by an almost constant change, whereas for PPO-M the changes are highly inconsistent. The code-level optimizations of PPO can mitigate this to some extend but cannot properly enforce the desired constant change in the policy distribution. In particular, we have found that primarily the learning rate decay contributes to the relatively good behavior of PPO. Albeit, PAPI provides a similar principled trust region projection as we do, it still has some inconsistency by approaching the bound iteratively.

Entropy Control

To demonstrate the effect of combining our projections with entropy control, as described in Section 4.2, we evaluate all Mujoco tasks again for this extended setting. The target entropy in each iteration i is computed by exponentially decaying the initial entropy H 0 to κ with temperature τ as κ + (H 0κ)τ 10i N , where N is the total number of training steps. The bottom of Table 1 shows the results for our projections with entropy control. Especially on the more complex tasks with more exploration, all three projections significantly benefit from the entropy control. Their asymptotic performance for the HalfCheetah-v2, Ant-v2, and Humanoid-v2 increases and yields a much faster convergence in the latter. For the other Mujoco tasks the performance remains largely constant since the complexity of these tasks is insufficient to benefit from an explicit entropy control, as also noted by Pajarinen et al. (2019) and Abdolmaleki et al. (2015) . Contextual Covariances. To emphasize the advantage of state-wise trust regions we consider the case of policies with state-dependent covariances. Existing methods, such as PPO and TRPO, are rarely used in this setting. In addition, PAPI cannot project the covariance in the contextual case. Further, Andrychowicz et al. (2020) demonstrated that for the standard Mujoco benchmarks, contextual covariances are not beneficial in an on-policy setting. Therefore, we choose to evaluate on a task motivated from optimal control which benefits from a contextual covariance. We extend the Mujoco Reacher-v2 to a 5-link planar robot, the distance penalty to the target is only provided in the last time step, t = 200, and the observation space also contains the current time step t. This semisparse reward specification imposes a significantly harder exploration problem as the agent is only provided with a feedback at the last time step. We again tuned all hyperparameters using Optuna Akiba et al. (2019) and did not include the entropy projection. All feasible approaches are compared with and without contextual covariances, the results therefor are presented in Figure 2 (right). All three projections significantly outperform the baseline methods with the non-contextual covariance. Additionally, both the W2 and KL projection improve their results in the contextual case. In contrast, all baselines decrease in performance and are not able to leverage the advantage of contextual information. This poor performance mainly originates from incorrect exploitation. PPO reduces the covariance too quickly, whereas PAPI reduces it too slowly, leading to a suboptimal performance for both. The Frobenius projection, however, does not benefit from contextual covariances either, since numerical instabilities arise from too small covariance values close to convergence. Those issues can be mitigated using a smaller covariance bound, but they cannot be entirely avoided. The KL projection, while yielding the best results throughout all experiments, relies on a numerical optimization. Generally, this is computationally expensive, however, by leveraging an efficient C++ implementation this problem can be negated (see Appendix B.5). As a bonus, the KL projection has all properties of existing KL-based trust region methods that have monotonic improvement guarantees. Nevertheless, for quick benchmarks, the W2 is preferred, given it is slightly less prone to hyperparameter choices and does not require a dedicated custom implementation. Trust Region Regression Loss. Lastly, we investigate the main approximation of our approach, the trust region regression loss (Equation 12). In the following ablation, we evaluate how different choices of the regression weight α affect the constraint satisfaction. Figure 2 (center) shows the Mahalanobis distance between the unprojected and the old policy means for different α values. In addition, for one run we choose α = 0 and execute the trust region regression separately after each epoch for several iterations. One key observation is that decreasing the penalty up to a certain threshold leads to larger changes in the policy and pushes the mean closer to its maximum bound. Intuitively, this can be explained by the construction of the bound. As the penalty is added only to the loss when the bound is violated, larger changes in the policy are punished while smaller steps do not directly affect the loss negatively. By selecting a larger α, this behavior is reinforced. Furthermore, we can see that some smaller values of α yield a behavior which is similar to the full regression setting. Consequently, it is justified to use a computationally simpler penalty instead of performing a full regression after each epoch.

6. DISCUSSION AND FUTURE WORK

In this work we proposed differentiable projection layers to enforce trust region constraints for Gaussian policies in deep reinforcement learning. While being more stable than existing methods, they also offer the benefit of imposing the constraints on a state level. Unlike previous approaches that only constrain the expected change between successive policies and for whom monotonic improvement guarantees thus only hold approximately, we can constrain the maximum change. Our results illustrate that trust regions are an effective tool in policy search for a wide range of different similarity measures. Apart from the commonly used reverse KL, we also leverage the Wasserstein distance and Frobenius norm. We demonstrated the subtle but important differences between those three different types of trust regions and showed our benchmark performance is on par or better than existing methods that use more code-level optimizations. For future work, we plan to continue our research with more exploration-heavy environments, in particular with contextual covariances. Additionally, more sophisticated heuristics or learning methods could be used to adapt the trust region bounds for better performance. Lastly, we are interested in using our trust region layers for other deep reinforcement learning approaches, such as actor-critic methods.

A ALGORITHM

Algorithm 1 Differentiable Trust Region Layer. The trust region layer acts as final layer after predicting a Gaussian distribution. It projects this predicted Gaussian onto the trust region in case it is violating the specified bounds. As output it generates a projected mean and covariance that satisfy the respective trust region bound. The entropy control in the last step can be disabled. Initialize bounds µ , Σ , temperature τ as well as target κ and initial entropy H 0 . 1: procedure TRUSTREGIONLAYER(µ, Σ, µ old , Σ old ) 2: if d mean (µ, µ old ) > µ then 3: Compute μ with Equation 64: else 5: μ = µ 6: if d cov (Σ, Σ old ) > Σ then 7: Compute Σ with Equations 7, 9, or 11 8: else 9: Σ = Σ 10: β = κ + (H 0 -κ)τ 10i N (Optional) entropy control as described in Section 4.2 11: if H(Σ) < β then 12: c = exp {(β -H(Σ)) /dim(a)} 13: Σ = c Σ 14: return μ, Σ Algorithm 2 Algorithmic view of the proposed Trust Region Projections. The trust region projections itself do not require approximations, the old policy update in the last step is the only point where we introduce an approximation. This update would normally require additional supervised regression steps that minimize the distance between the network output and the projection. However, by leveraging the regression penalty during policy optimization this optimization step can be omitted. Both approaches yield a policy, which is independent of the old policy distribution, i. e. it can act without the projection while maintaining the trust region. However, the penalty does not require additional computation and the policy can directly generate new trajectories, equivalently to other trust region methods, such as PPO. 1: Initialize policy θ 0,0 2: for i = 0, 1, . . . , N do epoch 3: Collect set of trajectories D i = {τ k } with π(θ i,0 ) 4: Compute advantage estimates Ât with GAE 5: for j = 0, 1, . . . , M do 6: Use π(θ i,j ) to predict Gaussian action distributions N (µ i,j , Σ i,j ) for D i 7: π = TRUSTREGIONLAYER(µ i,j , Σ i,j , µ i,0 , Σ i,0 ) 8: Update policy with Adam using the following policy gradient: θ i,j+1 ← Adam ∇ θ E π(θi,0) π(a|s; θ) π(a|s; θ i,0 ) Ât -αE s∼p π(θ i,0 ) [d (π(•|s; θ), π(•|s; θ))] θ=θi,j 9: Successive policy update: θ i+1,0 ← θ i,M B DERIVATIONS B.1 PROOF OF THEOREM 1 This section provides a proof for Theorem 1. We mainly used the multiplicative version of the Brunn-Minkowski inequality log(α|Σ 1 | + β|Σ 2 |) ≥ log(|Σ 1 |) α log(|Σ 2 |) β where Σ 1 , Σ 2 are p.s.d, α, β are positive, and α + β = 1. Frobenius Projection H(π) = 0.5 log |2πe Σ| = 0.5 log 2πe 1 η + 1 Σ + η η + 1 Σ old ≥ 0.5 log (2πeΣ) 1 η+1 det (2πeΣ old ) η η+1 = 1 η + 1 0.5 log |2πeΣ| + η η + 1 0.5 log |2πeΣ old | = 1 η + 1 H(π θ ) + η η + 1 H(π θold ) ≥ minimum (H(π θ ), H(π θold )) Wasserstein Projection Let k denote the dimensionality of the distributions under consideration. H(π) = 0.5 log(2πe) k | Σ| = 0.5 log(2πe) k 1 η + 1 Σ 0.5 + η η + 1 Σ 0.5 old 2 = 0.5 log(2πe) k + log 1 η + 1 Σ 0.5 + η η + 1 Σ 0.5 old + log 1 η + 1 Σ 0.5 + η η + 1 Σ 0.5 old ≥ 0.5 log(2πe) k + log Σ 0.5 1 η+1 Σ 0.5 old η η+1 + log Σ 0.5 1 η+1 Σ 0.5 old η η+1 = 0.5 log(2πe) k + log |Σ| 1 η+1 |Σ old | η η+1 = 0.5 log | (2πeΣ| 1 η+1 |2πeΣ old | η η+1 = 1 η + 1 0.5 log |2πeΣ| + η η + 1 0.5 log |2πeΣ old | = 1 η + 1 H(π θ ) + η η + 1 H(π θold ) ≥ minimum (H(π), H(π old )) KL Projection H(π) = 0.5 log |2πe Σ| = 0.5 log 1 η + 1 (2πeΣ) -1 + η η + 1 (2πeΣ old ) -1 -1 = -0.5 log 1 η + 1 (2πeΣ) -1 + η η + 1 (2πeΣ old ) -1 ≤ -0.5 log (2πeΣ) -1 1 η+1 (2πeΣ old ) -1 η η+1 = 0.5 log |2πeΣ| 1 η+1 |2πeΣ old | η η+1 (use the fact that: det(A -1 ) = 1/ det(A)) = 1 η + 1 0.5 log |2πeΣ| + η η + 1 0.5 log |2πeΣ old | = 1 η + 1 H(π θ ) + η η + 1 H(π θold ) ≤ maximum (H(π θ ), H(π old )) B.2 MEAN PROJECTION First, we consider only the mean objective min μ (µ -μ) T Σ -1 old (µ -μ) s.t. (µ old -μ) T Σ -1 old (µ old -μ) ≤ µ , which give us the following dual L(μ, ω) = (µ -μ) T Σ -1 old (µ -μ) + ω (µ old -μ) T Σ -1 old (µ old -μ) -µ . Differentiating w.r.t. μ yields ∂L(μ, ω) ∂ μ = 2Σ -1 old (μ -µ) -2ωΣ -1 old (μ -µ old ) . Setting the derivative to 0 and solving for μ gives μ * = µ + ωµ old 1 + ω . Inserting the optimal mean μ * in Equation 13 results in L(ω) = µ + ωµ old 1 + ω -µ T Σ -1 old µ + ωµ old 1 + ω -µ + + ω µ + ωµ old 1 + ω -µ old T Σ -1 old µ + ωµ old 1 + ω -µ old -µ = ω 2 (µ -µ old ) T Σ -1 old (µ -µ old ) (1 + ω) 2 + ω (µ -µ old ) T Σ -1 old (µ -µ old ) (1 + ω) 2 -ω µ . Thus, differentiating w.r.t ω yields ∂L(ω) ∂ω = (µ -µ old ) T Σ -1 old (µ -µ old ) (1 + ω) 2 -µ . Now solving ∂L(ω) ∂ω ! = 0 for ω, we arrive at ω * = (µ -µ old ) T Σ -1 old (µ -µ old ) µ -1.

B.3 FROBENIUS COVARIANCE PROJECTION

We consider the following objective for the covariance part min Σ Σ -Σ 2 F s.t. Σ -Σ old 2 F ≤ Σ with the corresponding Lagrangian L( Σ, η) = Σ -Σ 2 F + η Σ -Σ old 2 F -Σ . Differentiating w.r.t. Σ yields ∂L( Σ, η) ∂ Σ = 2 Σ -Σ + η (Σ old -Σ) . We can again solve for Σ by setting the derivative to 0, i.e., Σ * = Σ + ηΣ old 1 + η . Inserting Σ * into Equation 14 yields the dual function g(η) = Σ + ηΣ old 1 + η -Σ 2 F + η Σ + ηΣ old 1 + η -Σ old 2 F -Σ . Differentiating w.r.t. η results in ∂L(η) ∂η = Σ -Σ old 2 F (1 + η) 2 -Σ . Hence, ∂L(η) ∂η ! = 0 yields η * = Σ -Σ old F √ Σ -1.

B.4 WASSERSTEIN COVARIANCE PROJECTION

As described in the main text, the Gaussian distributions to have been rescaled by Σ -1 old to measure the distance in the metric space that is defined by the variance of the data. For notational simplicity, we show the derivation of the covariance projection only for the unscaled scenario. The scaled version can be obtained by a simple redefinition of the covariance matrices. For our covariance projection we are interested in solving the following optimization problem min Σ tr Σ + Σ -2 Σ 1 /2 ΣΣ 1 /2 1 /2 s.t. tr Σ + Σ old -2 Σ 1 /2 old ΣΣ 1 /2 old 1 /2 ≤ Σ , which leads to the following Lagrangian function L( Σ, η) = tr Σ + Σ -2 Σ 1 /2 ΣΣ 1 /2 1 /2 + η tr Σ + Σ old -2 Σ 1 /2 old ΣΣ 1 /2 old 1 /2 -Σ . Assuming that Σ commutes with Σ as well as Σ old , Equation 15 simplifies to L( Σ, η) = tr Σ + Σ -2 Σ1 /2 Σ 1 /2 + η tr Σ + Σ old -2 Σ1 /2 Σ 1 /2 old - = tr S 2 + Σ -2SΣ 1 /2 + η tr S 2 + Σ old -2SΣ 1 /2 old -, where S is the unique positive semi-definite root of the positive semi-definite matrix Σ, i.e. S = Σ1 /2 . Instead of optimizing the objective w.r.t Σ, we optimize w.r.t S in order, which greatly simplifies the calculation. That is, we solve ∂L(S, η) ∂S = (1 + η)2S -2 Σ 1 /2 + ηΣ 1 /2 old ! = 0 for S, which leads us to S * = Σ 1 /2 + ηΣ 1 /2 old 1 + η , Σ * = Σ + η 2 Σ old + 2ηΣ 1 /2 Σ 1 /2 old (1 + η) 2 . Inserting this into Equation 16 yields the dual function g(η) = η tr Σ + Σ old -2Σ 1 /2 Σ 1 /2 old 1 + η -η Σ The derivative of the dual w.r.t. η is given by ∂L(η) ∂η = tr Σ + Σ old -2 Σ1 /2 Σ 1 /2 old (1 + λ) 2 -Σ . Now solving ∂L(η) ∂η ! = 0 for η, we arrive at η * = tr Σ + Σ old -2 Σ1 /2 Σ 1 /2 old Σ -1 B.5 KL-DIVERGENCE PROJECTION We derive the KL-Divergence projection in its general form, i.e., simultaneous projection of mean and covariance under an additional entropy constraint π * = arg min π KL (π||π θ ) s.t. KL (π||π θold ) ≤ , H (π) ≥ β. Instead of working with this minimization problem we consider the equivalent maximization problem π * = arg max π -KL (π||π θ ) s.t. KL (π||π θold ) ≤ , H (π) ≥ β, which is similar to the one considered in Model Based Relative Entropy Stochastic Search (MORE) (Abdolmaleki et al., 2015) , with a few distinctions. To see those distinctions let η and ω denote the Lagrangian multipliers corresponding to the KL and entropy constraint respectively and consider the Lagrangian corresponding to the optimization problem in Equation 17 Abdolmaleki et al. (2015) we are not working with an unknown reward but using the log density of the target distribution π instead. Thus we do not need to fit a surrogate and can directly read off the parameters of the squared reward. They are given by the natural parameters of π, i.e, Λ = Σ -1 and q = Σ -1 µ. Additionally, we need to add a constant 1 to ω to account for the additional entropy term in the original objective, similar to (Arenz et al., 2018) . L = -KL(π||π θ ) + η ( -KL(π||π θold )) + ω (H(π) -β) = E π [log π θ ] + η ( -KL(π||π θold )) + (ω + 1)H(π) -ωβ. Opposed to Following the derivations from Abdolmaleki et al. (2015) and Arenz et al. (2018) we can obtain a closed form solution for the natural parameters of π, given the Lagrangian multipliers η and ω Λ = ηΛ old + Λ η + 1 + ω and q = ηq old + q η + 1 + ω . ( ) To obtain the optimal Lagrangian multipliers we can solve the following convex dual function using gradient descent g(η, ω) =η -ωβ + η - 1 2 q T old Λ -1 old q old + 1 2 log det (Λ) - k 2 log(2π) + (η + 1 + ω) 1 2 qT Λ-1 q - 1 2 log det Λ + k 2 log(2π) + const, ∂g(η, ω) ∂η = -KL(π||π θold ) and ∂g(η, ω) ∂ω = H(π) -β. Given the optimal Lagrangian multipliers, η * and ω * we obtain the parameters of the optimal distribution π * using Equation 18. Forward Pass For the forward pass we compute the natural parameters of π, solve the optimization problem and compute mean and covariance of π * from the optimal natural parameters. The corresponding compute graph is given in Figure 3 . µ Σ q = Σ -1 µ Λ = Σ -1 η * ω * q * = η * q old + q η * + ω * + 1 Λ * = η * Λ old + Λ η * + ω * + 1 μ * = Λ * -1 q * Σ * = Λ * -1 Numerical Optimization

Analytical Computation

Figure 3 : Compute graph of the KL projection layer. The layer first computes the natural parameters of π from the mean and covariance. Then it numerically optimizes the dual to obtain the optimal Lagrangian multipliers which are used to get the optimal natural parameters. Ultimately, the optimal mean and covariance are computed from the optimal natural parameters. We omit the dependency on constants, i.e., the bound and β as well as the parameters of π old for clarity of the visualization. Backward Pass Given the computational graph in Figure 3 gradients can be propagated back though the layer using standard back-propagation. All gradients for the analytical computations (black arrows in Figure 3 ) are straight forward and can be found in (Petersen & Pedersen, 2012) . For the gradients of the numerical optimization of the dual (red arrows in Figure 3 ) we follow Amos & Kolter (2017) and differentiate the KKT conditions around the optimal Lagrangian multipliers computed during the forward pass. The KKT Conditions of the dual are given by ∇g(η * , ω * ) + m T ∇ -η * -ω * = -KL (π * ||π θold ) -m 1 H(π * ) -β -m 2 = 0, (Stationarity) m 1 (-η * ) = 0 and m 2 (-ω * ) = 0 (Complementary Slackness) here m = (m 1 , m 2 ) T denotes the Lagrangian multipliers for the box constraints of the dual (η and ω need to be non-negative). Taking the differentials of those conditions yields the equation system        - ∂KL (π * ||π θold ) ∂η * - ∂KL (π * ||π θold ) ∂ω * -1 0 ∂H(π * ) ∂η * ∂H(π * ) ∂ω * 0 -1 -m 1 0 -η * 0 0 -m 2 0 -ω *           dη dω dm 1 dm 2    =        ∂KL (π * ||π θold ) ∂q dq + ∂KL (π * ||π θold ) ∂Λ dΛ - ∂H(π * ) ∂q dq - ∂H(π * ) ∂Λ dΛ 0 0        which is (analytically) solved to obtain the desired partial derivatives ∂η ∂q , ∂η ∂Λ , ∂ω ∂q , and ∂ω ∂Λ . Implementation We implemented the whole layer using C++, Armadillo, and OpenMP for parallelization. The implementation saves all necessary quantities for the backward pass and thus a numerical optimization is only necessary during the forward pass. Before we perform a numerical optimization we check whether it is actually necessary. If the target distribution π is within the trust region we immediately can set π * = π θ , i.e., the forward and backward pass become the identity mapping. This check yield significant speed-ups, especially in early iterations, if the target is still close to the old distribution. If the projection is necessary we use the L-BFGS to optimize the 2D convex dual, which is still fast. For example, for a 17-dimensional action space and a batch size of 512, such as in the Humanoid-v2 experiments, the layer takes roughly 170ms for the forward pass and 3.5ms for the backward pass if the all 512 Gaussians are actually projectedfoot_1 . If none of the Gaussians needs to be projected its less than 1ms for forward and backward pass. Simplifications If only diagonal covariances are considered the implementation simplifies significantly, as computationally heavy operations (matrix inversions and cholesky decompositions) simplify to pointwise operations (divisions and square roots). If only the covariance part of the KL is projected, we set µ old = µ = μ * and dµ = 0 which is again a simplification for both the derivations and implementation. If an entropy equality constraint, instead of an inequality constraint, it is sufficient to remove the ω > 0 constraint in the dual optimization.

C ADDITIONAL RESULTS

Figure 4 shows the training curves for all Mujoco environments with a 95% confidence interval. Besides the projections we also show the performance for PAPI and PPO. In Figure 5 the projections also leverage the Entropy control based on the results from from Akrour et al. (2019) . We trained 40 agents with different seeds for each environment using five evaluation episodes for every data point. The plot shows the total mean reward with 95% confidence interval. E ) as well as PPO and PAPI on the test environment. We trained 40 agents with different seeds for each environment using five evaluation episodes for every data point. The plot shows the total mean reward with 95% confidence interval.

D HYPERPARAMETERS

Tables 2 and 3 show the hyperparameters used for the experiments in Table 1 . Target entropy, temperature, and entropy equality are only required when the entropy projection is included in the layer, otherwise those values are ignored. 



We assume the true matrix square root Σ = Σ 1/2 Σ 1/2 and not a Cholesky factor Σ = LL T since it naturally appears in the expressions for the projected covariance from the original Wasserstein formulation. On a 8 Core Intel Core i7-9700K CPU @ 3.60GHz



Figure 1: (a), (b), and (c): Interpolated covariances for the different projections for various values of η. For Frobenius and Wasserstein the intermediate distributions clearly have a larger entropy, while for the KL projection the intermediate entropy is smaller. (d): Entropy of the interpolated distributions.In this example π and π old have the same entropy. It can be seen that the entropy increases for the Frobenius and Wasserstein projections when transitioning between the distributions, while it decreases for the KL. A more general statement regarding this can be found in Theorem 1.

Figure2: (Left): Mean KL divergence for Ant-v2 as a standardizing measure to compare the policy changes among all methods. (Center): Mahalanobis distance between the mean values of the unprojected and old policy when using different α in comparison to a full regression. The mean bound for the W2 projection is set to 0.03 (dotted black line). (Right): Mean cumulative reward with 95% confidence interval based on 40 seeds for the semi-sparse 5-link Reacher task. For each method, besides PAPI, we train policies with (dashed) and without (solid) contextual covariances.

Figure 4: Training curves for the projection layer as well as PPO and PAPI on the test environment.We trained 40 agents with different seeds for each environment using five evaluation episodes for every data point. The plot shows the total mean reward with 95% confidence interval.

Figure5: Training curves for the projection layer with entropy control (-E) as well as PPO and PAPI on the test environment. We trained 40 agents with different seeds for each environment using five evaluation episodes for every data point. The plot shows the total mean reward with 95% confidence interval.

Mean return with 95% confidence interval of 20 epochs after completing 20% of the total training and for the last 20 epochs. We trained 40 different seeds for each experiment and computed five evaluation rollouts per epoch. The projections with (-E) and without entropy control are considered separately, therefore, each column may have up to two best runs (bold).

Hyperparameters for all three projections as well as PAPI, PPO. and PPO-M on the Mujoco benchmarks from Table1

Hyperparameters for all three projection as well as PAPI and PPO on the Humanoid-v2 from Table1.

Hyperparameters for all three projection as well as PAPI and PPO on our ReacherSparse-v0 task from Figure2. The second value for Σ of the KL projection is the bound when using a contextual covariance.

