MONOTONIC ROBUST POLICY OPTIMIZATION WITH MODEL DISCREPANCY

Abstract

State-of-the-art deep reinforcement learning (DRL) algorithms tend to overfit in some specific environments due to the lack of data diversity in training. To mitigate the model discrepancy between training and target (testing) environments, domain randomization (DR) can generate plenty of environments with a sufficient diversity by randomly sampling environment parameters in simulator. Though standard DR using a uniform distribution improves the average performance on the whole range of environments, the worst-case environment is usually neglected without any performance guarantee. Since the average and worst-case performance are equally important for the generalization in RL, in this paper, we propose a policy optimization approach for concurrently improving the policy's performance in the average case (i.e., over all possible environments) and the worst-case environment. We theoretically derive a lower bound for the worst-case performance of a given policy over all environments. Guided by this lower bound, we formulate an optimization problem which aims to optimize the policy and sampling distribution together, such that the constrained expected performance of all environments is maximized. We prove that the worst-case performance is monotonically improved by iteratively solving this optimization problem. Based on the proposed lower bound, we develop a practical algorithm, named monotonic robust policy optimization (MRPO), and validate MRPO on several robot control tasks. By modifying the environment parameters in simulation, we obtain environments for the same task but with different transition dynamics for training and testing. We demonstrate that MRPO can improve both the average and worst-case performance in the training environments, and facilitate the learned policy with a better generalization capability in unseen testing environments.

1. INTRODUCTION

With deep neural network approximation, deep reinforcement learning (DRL) has extended classical reinforcement learning (RL) algorithms to successfully solving complex control tasks, e.g., playing computer games with human-level performance (Mnih et al., 2013; Silver et al., 2018) and continuous robotic control (Schulman et al., 2017) . By random exploration, DRL often requires tremendous amounts of data to train a reliable policy. It is thus infeasible for many tasks, such as robotic control and autonomous driving, as training in the real world is not only time-consuming and expensive, but also dangerous. Therefore, training is often conducted on a very limited set of samples, resulting in overfitting and poor generalization capability. One alternative solution is to learn a policy in a simulator (i.e., source/training environment) and then transfer it to the real world (i.e., target/testing environment). Currently, it is impossible to model the exact environment and physics of the real world. For instance, the physical effects like nonrigidity and fluid dynamics are quite difficult to be accurately modeled by simulation. How to mitigate the model discrepancy between the training and target environments remains challenging for the generalization in RL. To simulate the dynamics of the environment, domain randomization (DR), a simple but effective method is proposed. It randomizes the simulator (e.g., by randomizing the distribution of environment parameters) to generate a variety of environments for training the policy in the source domain. Compared with training in a single environment, recent researches have shown that policies learned through an ensemble of environment dynamics obtained by DR achieve better generalization performance with respect to the expected return. The expected return is referred to as the average per-formance across all the trajectories sampled from different environments. Since these trajectories, regardless of their performance, are uniformly sampled, the trajectories with the worst performance would severely degrade the overall performance. In contrast, another line of research on the generalization in RL is from the perspective of control theory, i.e., learning policies that are robust to environment perturbations. Robust RL algorithms learn policies, also using model ensembles produced by perturbing the parameters of the nominal model. EPOpt (Rajeswaran et al., 2017) , a representative of them, trains policy solely on the worst performing subset, i.e., trajectories with the worst α percentile of returns, while discarding all the higher performing trajectories. In other words, it seeks a higher worst-case performance at the cost of degradation on the average performance. In general, robust RL algorithms may sacrifice performance on many environment variants and focus only on environments with the worst performance, such that the policy learned will not behave very badly in a previously unseen environment. In this paper, we focus on the generalization issue in RL, and aim to mitigate the model discrepancy of the transition dynamics between the training and target environments. Considering that both the average and worst-case performance are equally important for evaluating the generalization capability of the policy, we propose a policy optimization approach in which the distribution of the sampled trajectories are specifically designed for concurrently improving both the average and worst-case performance. Our main contributions are summarized as follows. • For a given policy and a wide range of environments, we theoretically derive a lower bound for the worst-case expected return of that policy over all the environments, and prove that maximizing this lower bound (equivalent to maximizing the worst-case performance) can be achieved by solving an average performance maximization problem, subject to constraints that bound the update step in policy optimization and statistical distance between the worst and average case environments. To the best of our knowledge, this theoretical analysis of the relationship between the worst-case and average performance is reported for the first time, which provides a practical guidance for updating policies towards both the worst-case and average performance maximization. • Trajectories obtained from diverse environments may contribute differently to the generalization capacity of the policy. Therefore, in face of a huge amount of trajectories, the problem that which types of trajectories are likely to mostly affect the generalization performance should be considered. Unlike traditional uniform sampling without the worst-case performance guarantee, and different from the worst α percentile sampling in which the parameter α is empirically preset, we propose a criterion for the sampling trajectory selection based on the proposed worst-case and average performance maximization, with which both the environment diversity and the worst-case environments are taken into account. • Based on the proposed theorem, we develop a monotonic robust policy optimization (MRPO) algorithm to learn the optimal policy with both the maximum worst-case and average performance. Specifically, MRPO carries out a two-step optimization to update the policy and the distribution of the sampled trajectories, respectively. We further prove that the policy optimization problem can be transformed to trust region policy optimization (TRPO) (Schulman et al., 2015) on all possible environments, such that the policy update can be implemented by the commonly used proximal policy optimization (PPO) algorithm (Schulman et al., 2017) . Finally, we prove that by updating the policy with the MRPO, the worst-case expected return can be monotonically increased. • To greatly reduce the computational complexity, we impose Lipschitz continuity assumptions on the transition dynamics and propose a practical implementation of MRPO. We then conduct experiments on five robot control tasks with variable transition dynamics of environments, and show that MRPO can improve both the average and worst-case performance in the training environments compared to DR and Robust RL baselines, and significantly facilitate the learned policy with a better generalization capability in unseen testing environments.

2. BACKGROUND

Under the standard RL setting, the environment is modeled as a Markov decision process (MDP) defined by a tuple < S, A, T , R >. S is the state space and A is the action space. For the convenience of derivation, we assume they are finite. T : S × A × S → [0, 1] is the transition dynamics determined by the environment parameter p ∈ P, where P denotes the environment pa-rameter space. For example in robot control, environment parameter could be physical coefficient that directly affect the control like friction of joints and torso mass. Throughout this paper, by environment p, we mean that an environment has the transition dynamics determined by parameter p. R : S × A → R is the reward function. At each time step t, the agent observes the state s t ∈ S and takes an action a t ∈ A guided by policy π(a t |s t ). Then, the agent will receive a reward r t = R(s t , a t ) and the environment shifts from current state s t to the next state s t+1 with probability T (s t+1 |s t , a t , p). The goal of RL is to search for a policy π that maximizes the expected cumulative discounted reward η(π|p ) = E τ [G(τ |p)], G(τ |p) = ∞ t=0 γ t r t . τ = {s t , a t , r t , s t+1 } ∞ t=0 denotes the trajectory generated by policy π in environment p and γ ∈ [0, 1] is the discount factor. We can then define the state value function as V π (s) = E ∞ k=0 γ k r t+k |s t = s , the ac- tion value function as Q π (s, a) = E ∞ k=0 γ k r t+k |s t = s, a t = a ,

and the advantage function as

A π (s, a) = Q π (s, a) -V π (s). We denote the state distribution under environment p and policy π as P π (s|p) and that at time step t as P t π (s|p). During the policy optimization in RL, by updating the current policy π to a new policy π, Schulman et al. (2015) prove that η(π|p) ≥ L π (π|p) - 2λγ (1 -γ) 2 β 2 , L π (π|p) = η(π|p) + E s∼Pπ(•|p),a∼π(•|s) π(a|s) π(a|s) A π (s, a) where λ = max s |E a∼π(a|s) [A π (s, a)]| is the maximum mean advantage following current policy π and β = max s D T V (π(•|s) π(•|s)) is the maximum total variation (TV) distance between π and π. The policy's expected return after updating can be monotonically improved by maximizing the lower bound in (1) w.r.t. π. Based on this and with certain approximation, Schulman et al. (2015) then propose a algorithm named trust region policy optimization (TRPO) that optimizes π towards the direction of maximizing L π (π), subject to the trust region constraint β ≤ δ. In standard RL, environment parameter p is fixed without any model discrepancy. While under the domain randomization (DR) settings, because of the existence of model discrepancy, environment parameter should actually be a random variable p following a probability distribution P over P. By introducing DR, the goal of policy optimization is to maximize the mean expected cumulative discounted reward over all possible environment parameters, i.e., max π E p∼P [η(π|p)]. In face of model discrepancy, our goal is to provide a performance improvement guarantee for the worst-case environments, and meanwhile to improve the average performance over all environments. Lemma 1. Guided by a certain policy π, there exists a non-negative constant C ≥ 0, such that the expected cumulative discounted reward in environment with the worst-case performance satisfies: η(π|p w ) -E p∼P [η(π|p)] ≥ -C, where environment p w corresponds to the worst-case performance, and C is related to p w and π. Proof. See Appendix A.1 for details. Theorem 1. In MDPs where reward function is bounded, for any distribution P over P, by updating the current policy π to a new policy π, the following bound holds: η(π|p w ) ≥ E p∼P [η(π|p)] -2|r| max γE p∼P [ (p w p)] (1 -γ) 2 - 4|r| max α (1 -γ) 2 , where (p w p) max t E s ∼P t (•|pw) E a∼π(•|s ) D T V (T (s|s , a, p w ) T (s|s , a, p)), environment p w corresponds to the worst-case performance under the current policy π, and α max t E s ∼P t (•|pw) D T V (π(a|s ) π(a|s )). Proof. See Appendix A.2 for details, and Appendix A.7 for bounded reward function condition. In (3), (p w p) specifies the model discrepancy between two environments p w and p in terms of the maximum expected TV distance of their transition dynamics of all time steps in trajectory sampled in environment p w using policy π, and α denotes the maximum expected TV distance of two policies along trajectory sampled in environment p w using policy π. In general, the RHS of (3) provides a lower-bound for the expected return achieved in the worst-case environment p w , where the first term denotes the mean expected cumulative discounted reward over all environments following the Algorithm 1 Monotonic Robust Policy Optimization 1: Initialize policy π 0 , uniform distribution of environment parameters U , number of environment parameters sampled per iteration M , maximum number of iterations N and maximum episode length T . 2: for k = 0 to N -1 do 3: Sample a set of environment parameters {p i } M -1 i=0 according to U .

4:

for i = 0 to M -1 do 5: Sample L trajectories {τ i,j } L-1 j=0 in environment p i using π k .

6:

Determine p k w = arg min pi∈{pi} M -1 i=0 L-1 j=0 G(τ i,j |p i )/L. 7: Compute Ê(p i , π k ) = L-1 j=0 G(τi,j |pi) L - 2|r|maxγ (pi p k w ) (1-γ) 2 for environment p i .

8:

end for 9: Select the trajectory set T = {τ i : Ê(p i , π k ) ≥ Ê(p k w , π k )}. 10: Use PPO for policy optimization on T to get the updated policy π k+1 . 11: end for sampling distribution P , while the other two terms can be considered as penalization on a large TV distance between the worst-case environment p w and the average case, and a large update step from the current policy π to the new policy π, respectively. Therefore, by maximizing this lower bound, we can improve the worst-case performance, which in practice is equivalent to the following constrained optimization problem with two constraints: max π,P E p∼P [η(π|p)] s.t. α ≤ δ 1 , E p∼P [ (p w p)] ≤ δ 2 . The optimization objective is to maximize the mean expected cumulative discounted reward over all possible environments, by updating not only the policy π, but the environment parameter's sampling distribution P . The first constraint imposes a similar trust region to TRPO (Schulman et al., 2017) that constrains the update step in policy optimization. In addition, we further propose a new trust region constraint on the sampling distribution P that the TV distance between the worst-case environment p w and average case over P is bounded, such that by achieving the optimization objective in (4), the worst-case performance is also improved. To solve the constrained optimization problem in (4), we need to seek for the optimal policy π and the distribution P of the sampled trajectories. In practice, we carry out a two-step optimization procedure to simplify the computational complexity. First, we fix the policy by letting π = π, and optimize the objective in (4) w.r.t. the distribution P . In this case, we no longer need to consider the first constraint on the policy update, and thus can convert the second constraint on the sampling distribution into the objective with the guidance of Theorem 1, formulating the following unconstrained optimization problem: max P E p∼P [E(p, π)] , where we denote E(p, π) η(π|p) -2|r|maxγ (p pw) (1-γ) 2 . The first term in E(p, π) indicates policy π's performance in environment p, while the second term measures the model discrepancy between environment p and p w . Since the objective function in ( 5) is linear to P , we can update P by assigning a higher probability to environment p with higher E(p, π). As a consequence, sampling according to E(p, π) would increase the sampling probability of environments with both poor and good-enough performance, and avoid being trapped in the worst-case environment. Specifically, we propose to select samples from environment p that meets E(p, π) ≥ E(p w , π) for the training of policy π, which is equivalent to assigning a zero probability to the other samples. In the second step, we target at optimizing the policy π with the updated distribution P fixed, i.e., the following optimization problem: max π E p∼P [η(π|p)] s.t. α ≤ δ 1 (6) Optimization in ( 6) can be transformed to a trust region robust policy optimization similar to TRPO and solve it practically with PPO (refer to Appendix A.3 and Schulman et al. (2017) for more information). To summarize, we propose a monotonic robust policy optimization (MRPO) in Algorithm1. At each iteration k, we uniformly sample M environments and run a trajectory for each sampled environment. For each environment p i , we sample L trajectories {τ i,j } L j=1 , approximate η(π k |p i ) with L-1 j=0 G(τ i,j |p i )/L, and determine the worst-case environment p w based on L-1 j=0 G(τ i,j |p i )/L of a given set of environments p i M -1 i=0 , We then optimize the policy with PPO on the selected trajectory subset T according to E(p i , π k ) and E(p k w , π k ). We now formally show that by maximizing the lower bound provided in Theorem1, the worst-case performance within all the environments can be monotonically improved by MRPO. Theorem 2. The sequence of policy {π 1 , π 2 , . . . , π N } generated by Algorithm1 is guaranteed with the monotonic worst-case performance improvement, i.e., η(π 1 |p 1 w ) ≤ η(π 2 |p 2 w ) ≤ • • • ≤ η(π N |p N w ), where p k w denotes the parameter of environment with the worst-case performance guided by the current policy π k at iteration k. Proof. See Appendix A.4 for details.

3. PRACTICAL IMPLEMENTATION USING SIMULATOR

Motivated by Theorem 1, we propose Algorithm 1 that provably promotes monotonic improvement for the policy's performance in the worst-case environment according to Theorem 2. However, Theorem 1 imposes calculation of (p w p) that requires the estimation of the expected total variation distance between the worst-case environment and every other sampled environment at each time step. Estimation by sampling takes exponential complexity. Besides, in the model-free setting, we unfortunately have no access to analytical equation of the environment's transition dynamics. Hence, computation for the total variation distance between two environments is unavailable. Under the deterministic state transition, an environment will shift its state with a probability of one. Hence, we can set state of one environment's simulator to the state along the trajectory from the other environment, and take the same action. We then can compare the next state and compute the total variation distance step by step. Though feasible, this method requires large computational consumption. In this section, we propose instead a practical implementation for Algorithm 1. According to Appendix A.9, we first make a strong assumption that the transition dynamics model is L p -Lipschitz in terms of the environment parameter p: T (s|s , a, p) -T (s|s , a, p w ) ≤ L p p -p w . Then, we can simplify the calculation of (p w p) via: (p w p) ≤ max t E s ∼P t (•|pw,π) E a∼π(•|s ) 1 2 s L p p -p w = 1 2 s L p p -p w . It can be seen from the expression of (p w p) that it measures the transition dynamics distance between the worst-case environment p w and a specific environment p. In simulator, the environment's transition dynamics would vary with the environment parameters, such as the friction and mass in robot simulation. Hence the difference between environment parameters can reflect the distance between the transition dynamics. In addition, if we use in practice the penalty coefficient of (p w p) as recommended by Theorem 1, the subset T would be very small. Therefore, we integrate it with L p as a tunable hyperparameter κ, and propose a practical version of MRPO in Algorithm 2 in Appendix A.6.

4. EXPERIMENTS

We now evaluate the proposed MRPO in five robot control benchmarks designed for evaluation of generalization under changeable dynamics. These five environments are modified by the opensource generalization benchmarks (Packer et al., 2018) based on the complex robot control tasks introduced in (Schulman et al., 2017) . We compare MRPO with two baselines, PPO-DR and PW-DR, respectively. In PPO-DR, PPO is applied for the policy optimization in DR. In PW-DR, we use purely trajectories from the 10% worst-case environments for training and still apply PPO for the policy optimization. Note that PW-DR is the implementation of EPOpt algorithm proposed in (Rajeswaran et al., 2017) without the value function baseline. Since this value function baseline in EPOpt is not proposed for the generalization improvement (in fact it could incur a performance improvement for all the policy optimization algorithms), we do not adopt it in PW-DR. We utilize two 64-unit hidden layers to construct the policy network and value function in PPO. For MRPO, we use the practical implementation as described in Algorithm 2. Further note that during the experiments, we find that environments generating poor performance would be far away from each other in terms of the TV distance between their transition dynamics with the dimension of changeable parameters increasing. In other words, a single worst-case environment usually may not represent all the environments where the current policy performs very poorly. Hence, we choose the 10% worst-case environments to replace the calculation of Ê (p k w , π k ) in Algorithm 2. That is, the trajectories we add to the subset T should have the Ê (p i , π k ) greater than and equal to those of all the 10% worst-case environments. From the expression of Ê (p i , π k ) = L-1 j=0 G(τ i,j |p i )/L -κ p i -p k w for environment p i , it can be seen that the trajectory selection is based on a trade-off between the performance and the distance to the worst-case environment, as we described in detail in the paragraph under (5). Our experiments are designed to investigate the following two classes of questions: • Can MRPO effectively improve the policy's worst-case and average performance over the whole range of environments during training. Compared to DR, will the policy's average performance degrade by using MRPO? Will MRPO outperform PW-DR in the worst-case environments? • How does the performance of trained robust policy using MRPO degrade when employed in environments with unseen dynamics? And what determines the generalization performance to unseen environments during training?

4.1. TRAINING PERFORMANCE WITH DIFFERENT DYNAMICS

The five robot control tasks are as follows. (1) Walker2d: control a 2D bipedal robot to walk; (2) Hopper: control a 2D one-legged robot to hop as fast as possible; (3) HalfCheetah: control a 2D cheetah robot to run (Brockman et al., 2016) ; (4) InvertedDoublePendulum: control a cart (attached to a two-link pendulum system by a joint) by applying a force to prevent the two-link pendulum from falling over; and (5) Cartpole: control a cart (attached to a pole by a joint) by applying a force to prevent the pole from falling over. In robot control, the environment dynamics are directly related to the value of some physical coefficients. For example, if sliding friction of the joints is large, it will be more difficult for the agent to manipulate the robot compared to a smaller sliding friction. Hence, a policy that performs well for a small friction may not by generalized to the environment with a large friction due to the change of dynamics. In the simulator, by randomizing certain environment parameters, we can obtain a set of environments with the same goal but different dynamics (Packer et al., 2018) . The range of parameters that we preset for training of each environment is shown in Table 1 . We run the training process for the same number of iterations N sampled from preset range of environment parameters. At each iteration k, we generate trajectories from M = 100 environments sampled according to a uniform distribution U . The results are obtained by running each algorithm with five different random seeds. The average return is computed over the returns of M sampled environments at each iteration k. We show the training curves of Walker2d, Hopper and Halfcheetah, InvertedDoublePendulum, Cartpole, and HalfcheetahBroadRange, respectively, in Figs. 1(a )-1(c) and 1(g)-1(i). In Figure 1 , the solid curve is used to represent the average performance of each algorithm on all the five seeds, while the shaded-area denotes the standard error. It is seen that DR can steadily improve the average performance on the whole training range as expected, while MRPO does not significantly degrade the average performance in all the the tasks. PW-DR, on the other hand, focuses on the worst-case performance optimization, leading to an obvious degradation of av- 2 .

4.2. GENERALIZATION TO UNSEEN ENVIRONMENTS

MRPO has been demonstrated in theoretical analysis to optimize both average and worst-case performance during training. Here, we carry out experiments to show that MRPO can generalize to a broader range of unseen environments in testing. To this end, we compare the testing performance on some unseen environments of Walker2d, Hopper and HalfcheetahBroadRange with the best policies obtained by MRPO, DR and PW-DR from training, with the range of parameters set for testing showin in Table 1 . It is observed that policies all degrade with the decrease of friction, while the impact of unseen density is not that obvious as the friction. The heatmap of return achieved in all the testing Hopper environments by each algorithm is shown and compared in Fig. 2 . It can be seen that MRPO has better generalization ability to the unseen environments, while DR can hardly generalize in testing. Compared to PW-DR, MRPO has a broader generalization range, from which We remark that both the worst-case and average performance during training are crucial for the generalization to an unseen environment.

5. RELATED WORK

With the success of RL in recent years, plenty of works have focused on how to improve the generalization ability for RL. Learning a policy that is robust to the worst-case environment is one strategy. Based on theory of H ∞ control (Zhou et al., 1996) , robust RL takes into account the disturbance of environment parameters and model it as an adversary that is able to disturb transition dynamics in order to prevent the agent from achieving higher rewards (Morimoto & Doya, 2005) . The policy optimization is then formulated as a zero-sum game between the adversary and the RL agent. Pinto et al. (2017) incorporate robust RL to DRL method, which improves robustness of DRL in complex robot control tasks. To solve robust RL problem, robust dynamic programming formulates a robust value function and proposes accordingly a robust Bellman operator (Iyengar, 2005; Mankowitz et al., 2020) . The optimal robust policy can then be achieved by iteratively applying the robust Bellman operator in a similar way to the standard value iteration (Sutton & Barto, 2018) . Besides, Rajeswaran et al. (2017) leverage data from the worst-case environments as adversarial samples to train a robust policy. However, the aforementioned robust formulations will lead to an unstable learning. What's worse, the overall improvement of the average performance over the whole range of environments will also be stumbled by their focus on the worst-case environments. In contrast, in addition to the worst-case formulation, we also aim to improve the average performance. For generalization across different state spaces, an effective way is domain adaptation, which maps different state space to a common embedding space. The policy trained on this common space can then be easily generalized to a specific environment (Higgins et al., 2017b; James et al., 2019; Ammar et al., 2015) through a learned mapping, with certain mapping methods, such as β-VAE (Higgins et al., 2017a) , cGAN (Isola et al., 2017) , and manifold alignment (Wang & Mahadevan, 2009) . Function approximation enables RL to solve complex tasks with high-dimensional state and action spaces, which also incurs inherent generalization issue under supervised learning. Deep neural network (DNN) suffers overfitting due to the distribution discrepancy between training and testing sets. l 2 -regularization, dropout and dataset augmentation (Goodfellow et al., 2016) play an significant role for generalization in deep learning, which have also enabled improvement of policy's generalization on some specifically designed environments (Cobbe et al., 2019; Farebrother et al., 2018) . In terms of the theoretical analysis, Murphy (2005) provide a generalization error bound for Qlearning, where by generalization error they mean the distance between expected discounted reward achieved by converged Q-learning policy and the optimal policy. Wang et al. (2019) analyze the generalization gap in reparameterizable RL limited to the Lipschitz assumptions on transition dynamics, policy and reward function. For monotonic policy optimization in RL, Schulman et al. (2015) propose to optimize a constrained surrogate objective, which can guarantee the performance improvement of updated policy. In the context of model-based RL, Janner et al. ( 2019); Luo et al. (2019) formulate the lower bound for a certain policy's performance on true environment in terms of the performance on the learned model. It can therefore monotonically improve the performance on true environment by maximizing this lower bound. Different from this, the proposed MRPO in this work can guarantee the robustness of the policy in terms of the monotonically increased worst-case performance, and also improve the average performance.

A APPENDIX

A.1 PROOF OF LEMMA 1 Proof. First, we define η(π|p w ) -max p∈P η(π|p) -C 0 , where C 0 ≥ 0 depends on π and P. Then, given a policy π and any environment p ∈ P, and for any non-negative constant C ≥ C ≥ 0, we thus have: η(π|p w ) -E p∼P [η(π|p)] ≥ η(π|p w ) -max p∈P η(π|p) = -C 0 ≥ -C. A.2 PROOF OF THEOREM 1 Lemma 2. For any two joint distribution P 1 (x, y) = P 1 (x)P 1 (y|x) and P 2 (x, y) = P 2 (x)P (y|x) over x and y, we can bound the total variation distance of them by: D T V (P 1 (x, y) P 2 (x, y)) ≤ D T V (P 1 (x)|P 2 (x)) + max x D T V (P 1 (y|x) P 2 (y|x)) Proof. D T V (P 1 (x, y) P 2 (x, y)) = 1 2 x,y |P 1 (x, y) -P 2 (x, y)| (12) = 1 2 x,y |P 1 (x)P 1 (y|x) -P 2 (x)P 2 (y|x)| (13) = 1 2 x,y |P 1 (x)P 1 (y|x) -P 1 (x)P 2 (y|x) + P 1 (x)P 2 (y|x) -P 2 (x)P (y|x)| (14) ≤ 1 2 x,y P 1 (x)|P 1 (y|x) -P 2 (y|x)| + 1 2 x |P 1 (x) -P 2 (x)| (15) =E x∼P1 D T V (P 1 (y|x) P 2 (y|x)) + D T V (P 1 (x)|P 2 (x)) Lemma 3. Suppose the initial state distributions P 0 1 (s) and P 0 2 (s) are the same. Then the distance in the state marginal at time step t is bounded as: D T V (P t 1 (s) P t 2 (s)) ≤ t max t E s ∼P t 1 D T V (P 1 (s|s ) P 2 (s|s )) Proof. |P t 1 (s) -P t 2 (s)| =| s P 1 (s t = s|s )P t-1 1 (s ) - s P 2 (s t = s|s )P t-1 2 (s )| (18) ≤ s |P 1 (s t = s|s )P t-1 1 (s ) -P 2 (s t = s|s )P t-1 2 (s )| (19) = s |P 1 (s t = s|s )P t-1 1 (s ) -P 2 (s t = s|s )P t-1 1 (s ) + P 2 (s t = s|s )P t-1 1 (s ) -P 2 (s t = s|s )P t-1 2 (s )| (21) ≤E s ∼P t-1 1 |P 1 (s|s ) -P 2 (s|s )| + s P 2 (s|s )|P t-1 1 (s ) -P t-1 2 (s )| (22) D T V (P t 1 (s) P t 2 (s)) ≤ 1 2 s |P t 1 (s) -P t 2 (s)| (23) ≤ 1 2 s E s ∼P t-1 1 |P 1 (s|s ) -P 2 (s|s )| + s P 2 (s|s )|P t-1 1 (s ) -P t-1 2 (s )| (24) =E s ∼P t-1 1 D T V (P 1 (s|s ) P 2 (s|s )) + D T V (P t-1 1 (s ) P t-1 2 (s )) ≤ t i=1 E s ∼P i-1 1 D T V (P 1 (s|s ) P 2 (s|s )) ≤t max t E s ∼P t 1 D T V (P 1 (s|s ) P 2 (s|s )) Theorem 3. (A modified version of Theorem 1.) For any distribution P over P, by updating the current policy π to a new policy π, the following bound holds: η(π|p w ) -E p∼P [η(π|p)] ≥ -2|r| max γE p∼P [ (p w p)] (1 -γ) 2 - 4|r| max α (1 -γ) 2 , where (p w p) max t E s ∼P t (•|pw) E a∼π(•|s ) D T V (T (s|s , a, p w ) T (s|s , a, p)) , environment p w corresponds to the worst-case performance, and α max t E s ∼P t (•|pw) D T V (π(a|s ) π(a|s )). Proof. We can rewrite the LHS of (2) as: η(π|p w ) -E p∼P [η(π|p)] = η(π|p w ) -η(π|p w ) + η(π|p w ) -E p∼P [η(π|p)] . For the last two term, we have E p∼P |η(π|p w ) -η(π|p)| =E p∼P | t γ t s,a (P t (s, a|p w ) -P t (s, a|p)R(s, a))| ≤E p∼P t γ t s,a |P t (s, a|p w ) -P t (s, a|p)||R(s, a)| =2|r| max t γ t E p∼P D T V (P t (s, a|p w ) P t (s, a|p)) , where P t (s, a|p w ) = π(a|s)P t (s|p w ) and P t (s, a|p) = π(a|s)P t (s|p). By Lemma 2, we have E p∼P D T V (P t (s, a|p w )|P t (s, a|p)) ≤E s∼P t (•|pw) D T V (π(a|s) π(a|s)) + E p∼P D T V (P t (s|p w ) P t (s|p)) . Note that P (s|s , p w ) = a T (s|s , a, p w )π(a|s ) P (s|s , p) = a T (s|s , a, p)π(a|s ). Similar to Lemma 2, we have D T V (P (s|s , p w ) P (s|s , p)) (33) = 1 2 s a |T (s|s , a, p w )π(a|s ) -T (s|s , a, p)π(a|s )| (34) ≤ 1 2 s a |T (s|s , a, p w ) -T (s|s , a, p)|π(a|s ) + 1 2 s a T (s|s , a, p)|π(a|s ) -π(a|s )| (35) =E a∼π(•|s ) D T V (T (s|s , a, p w ) T (s|s , a, p)) + D T V (π(a|s ) π(a|s )) Under review as a conference paper at ICLR 2021 By Lemma 3, we have 29), ( 30) and ( 37), and referring to Jensen's inequality, we have E p∼P D T V (P t (s|p w ) P t (s|p)) ≤tE p∼P max t E s ∼P t (•|pw) D T V (P (s|s , p w ) P (s|s , p)) ≤tE p∼P max t E s ∼P t (•|pw) E a∼π(•|s ) D T V (T (s|s , a, p w ) T (s|s , a, p)) + t max t E s ∼P t (•|pw) D T V (π(a|s ) π(a|s )) (37) Since (p p w ) = max t E s ∼P t (•|pw) E a∼π(•|s ) D T V (T (s|s , a, p w ) T (s|s , a, p)) and α = max t E s ∼P t (•|pw) D T V (π(a|s ) π(a|s )), combining ( |η(π|p w ) -E p∼P η(π|p)| ≤E p∼P |η(π|p w ) -η(π|p)| ≤2|r| max t γ t E p∼P (t + 1) max t E s ∼P t (•|pw) D T V (π(a|s ) π(a|s )) +t max t E s ∼P t (•|pw) E a∼π(•|s ) D T V (T (s|s , a, p w ) T (s|s , a, p)) =2|r| max t γ t [t(E p∼P [ (p w p)] + α) + α] =2|r| max γE p∼P [ (p w p)] (1 -γ) 2 + α (1 -γ) 2 . ( ) With policy π be updated to π, η(π|p w ) ≤ E p∼P [η(π|p)]. Then, η(π|p w ) -E p∼P [η(π|p)] ≥ -2|r| max γE p∼P [ (p w p)] (1 -γ) 2 + α (1 -γ) 2 . ( ) Similar to the derivation of ( 38) and refer to Janner et al. (2019) , we have η(π|p w ) -η(π|p w ) ≥ - 2|r| max α (1 -γ) 2 . ( ) Combining the above results, we end up with the proof η(π|p w ) -E p∼P [η(π|p)] ≥ -2|r| max γE p∼P [ (p w p)] (1 -γ) 2 - 4|r| max α (1 -γ) 2 . A.3 DERIVATION OF POLICY OPTIMIZATION STEP In the policy optimization step, we aim to solve the following optimization problem: max π E p∼P [η(π|p)] s.t. α ≤ δ 1 . Referring to (1), we have: E p∼P [η(π|p)] ≥ E p∼P [L π (π|p)] - 2λγ (1 -γ) 2 β 2 We now turn to optimize the RHS of (43) to maximize the objective in (6) under constraints α ≤ δ 1 : max π E p∼P [L π (π|p)] - 2λγ (1 -γ) 2 β 2 s.t. α ≤ δ 1 . ( ) Note that we have: α = max t E s ∼P t (•|pw) D T V (π(a|s ) π(a|s )) ≤ max s D T V (π(a|s ) π(a|s )) = β. ( ) Following the approximation in Schulman et al. (2015) , ( 44) can be equivalently transformed to: E p∼P E s∼Pπ(•|p),a∼π(•|s) π(a|s) π(a|s) A π (s, a) s.t. β ≤ δ, which can be solved using PPO (Schulman et al., 2017) . A.4 PROOF OF THEOREM 2 Proof. Denote H(π k π k+1 ) max t E s ∼P t (•|pw) D T V (π k (a|s ) π k+1 (a|s )). Updating π k to π k+1 at each iteration k and following Theorem 1, we have η(π k+1 |p k w ) ≥ E p∼P k+1 η(π k+1 |p) - 2|r| max γ (p p k w ) (1 -γ) 2 - 4|r| max H(π k π k+1 ) (1 -γ) 2 . ( ) Since P k+1 and π k+1 are obtained by maximizing the RHS of (3), we have E p∼P k+1 η(π k+1 |p) - 2|r| max γ (p p k w ) (1 -γ) 2 - 4|r| max H(π k π k+1 ) (1 -γ) 2 . ( ) ≥E p∼P k+1 η(π k |p) - 2r max γ (p p k w ) (1 -γ) 2 - 4|r| max H(π k π k ) (1 -γ) 2 (49) =E p∼P k+1 η(π k |p) - 2r max γ (p p k w ) (1 -γ) 2 From Line 9 in Algorithm 1, the environment selected for training satisfies: η(π k |p) - 2|r| max γ (p p k w ) (1 -γ) 2 ≥ η(π k |p k w ) - 2|r| max γ (p k w p k w ) (1 -γ) 2 = η(π k |p k w ). Therefore, combining ( 47)-( 51), we have: η(π k+1 |p k+1 w ) ≈ η(π k+1 |p k w ) ≥ E p∼P k+1 [η(π k |p k w )] = η(π k |p k w ). where the approximation is made under the assumption that the expected returns of worst-case environment between two iterations are similar, which stems from the trust region constraint we impose on the update step between current and new policies, and can also be validated from experiments.

A.5 EMPIRICAL VERIFICATION OF ASSUMPTION IN THEOREM 2

To verify the assumption made in Theorem 2, in Fig. 3 , we study how the parameters of environments with poor performance scatter in the parameter space with different dimensions. Specifically, we plot the heatmap of return for the range of Hopper environments used for training, achieved by using MRPO to update the policy between two iterations. It can be validated that at the iteration k = 300, the poorly performing environments of the two policies before and after the MRPO update concentrate in the same region, i.e., the area of small frictions. The same result can be observed for the iteration k = 350. For example, as shown in Figs. w ) is 394.0. In both cases, the empirical results can support the assumption that we made in (52), i.e., the expected returns of worst-case environment between two iterations are similar.

A.6 PRACTICAL IMPLEMENTATION OF MRPO

It can be seen from the expression of (p w p) that it measures the transition dynamics distance between the worst-case environment p w and a specific environment p. In simulator, the environment's transition dynamics would vary with the environment parameters, such as the friction and mass in robot simulation. Hence the difference between environment parameters can reflect the distance between the transition dynamics. In addition, if we use in practice the penalty coefficient of (p w p) as recommended by Theorem 1, the subset T would be very small. Therefore, we integrate it with L p in (9) as a tunable hyperparameter κ, and propose a practical version of MRPO in Algorithm 2. Sample a set of environment parameters {p i } M -1 i=0 according to U .

4:

for i = 0 to M -1 do 5: Sample L trajectories {τ i,j } L-1 j=0 in environment p i using π k .

6:

Determine p k w = arg min pi∈{pi} M -1 i=0 L-1 j=0 G(τ i,j |p i )/L. 7: Compute Ê (p i , π k ) = L-1 j=0 G(τi,j |pi) L -κ p i -p k w for environment p i . 8: end for 9: Select trajectory set T = {τ i : Ê (p i , π k ) ≥ Ê (p k w , π k )}. 10: Use PPO for policy optimization on T to get the updated policy π k+1 . 11: end for

A.7 BOUNDED REWARD FUNCTION CONDITION IN ROBOT CONTROL TASKS

In Theorem 1, we state the condition that reward function is bounded. Referring to the source code of OpenAI gym (Brockman et al., 2016) , the reward function for the five robot control tasks evaluated in this paper are listed below. Hopper and Walker2d: R = x t+1 -x t + b -0.001|a t | 2 ; Halfcheetah: R = x t+1 -x t -0.001|a t | 2 ; In Hopper, Walker2d and Halfcheetah, x t+1 and x t denote the positions of the robot at timestep t + 1 and t, respectively. For Hopper and Walker2d, b ∈ {0, 1}, and b equals 0 when the robot falls down or 1 otherwise. The squared norm of action represents the energy cost of the system. Since the maximum distance that the robot can move in one timestep and the energy cost by taking an action at each timestep are bounded, these three tasks all have the bounded reward function. In Cartpole, the reward is always 1. In InvertedDoublePendulum, b equals 0 when the pendulum falls down or 10 otherwise, c dist is the distance between the robot and the centre, and c vel is the weighted sum of the two pendulum's angular velocities. Since all the three parameters b, c dist and c vel are physically bounded, the reward function, as a linear combination of them, is also bounded.

A.8 ANALYSIS OF THE MONTE CARLO ESTIMATION OF η(π|p)

In Theorem 1, the worst-case environment parameter p w needs to be selected according to the expected cumulative discounted reward η(π|p) of environment p. However, η(π|p) is infeasible to get in the practical implementation. Therefore, as a commonly used alternative approach as in (Rajeswaran et al., 2017) , we use the mean of the cumulative discounted reward of L sampled trajectories L-1 j=0 G(τ i,j |p i )/L to approximate the expectation η(π|p i ) = E τ [G(τ |p i )] of any environment p i , by using Monte Carlo method. We then determined the worst-case environment p w based on L-1 j=0 G(τ i,j |p i )/L of a given set of environments p i M -1 i=0 . In the following, we will analyze the impact of L on the estimation error. Theoretical analysis of the impact of L: Referring to Chebyshev's inequality, for any environment p i and any ε ≥ 0, with probability of at least 1 -σ 2 Lε 2 , we have L-1 j=0 G(τ i,j |p i ) L - L-1 j=0 E τi,j [G(τ i,j |p i )] L = L-1 j=0 G(τ i,j |p i ) L -η(π|p i ) ≤ ε, where σ = V ar(G(τ )|p i ) is the variance of trajectory τ 's return. From the above equation, we find out that the variance of the return does affect the MC estimation of η(π|p) and a larger L can guarantee a higher probability for the convergence of L-1 j=0 G(τ i,j |p i )/L to η(π|p i ). Empirical evaluation of the impact of L: In practice, we conduct experiment of MRPO on Hopper with different choices of L. We find out that the a larger L would not greatly affect the performance in terms of average return as shown in Fig. 4 (a), but will significantly increase the training A.9 ANALYSIS OF THE LIPSCHITZ ASSUMPTION In robot control tasks, classical optimal control methods commonly utilize the differential equation to formulate the dynamic model, which then indicates that the transition dynamics model is L p -Lipschitz and this formulated dynamic function can be used to estimate the Lipschitz constant L p . For example, the inverted double pendulum, one of our test environments, can be viewed as a twolink pendulum system (Chang et al., 2019) . To simplify the analysis, we illustrate here a single inverted pendulum, which is the basic unit that forms the inverted double pendulum system. The single inverted pendulum has two state variables θ and θ, and one control input u, where θ and θ represent the angular position from the inverted position and the angular velocity, respectively, and u is the torque. The system dynamics can therefore be described as θ = mgl sin θ + u -0.1 θ ml 2 , ( ) where m is the mass, g is the Gravitational acceleration, and l is the length of pendulum. In our setting, we may choose m as the variable environment parameter p. Since the above system dynamics are differentiable w.r.t. m, it can be verified that the maximum value of the first derivative of the system dynamic model can be chosen as the Lipschitz constant L p .

A.10 HYPERPARAMETER κ

In Algorithm 2, when we update the sampling distribution P for policy optimization, κ is a hyperparameter that controls the trade-off between the expected cumulative discounted reward η(π k |p i ) and distance p i -p k w to the worst-case environment. Theoretically, a larger κ means that the policy cares more about the poorly-performing environments, while a smaller κ would par more attention to the average performance. As empirical evaluation, we conduct experiment of MRPO on Hopper with different choices of hyperparameter κ. The training curves of both average return and the 10% worst-case return are shown in Figs. 5(a) and 5(b), respectively. It can be verified that for the fixed value choice of κ, the curve of κ = 5 outperforms the curves of κ = 20, 40, 60 in terms of the average return in Fig. 5 (a), while the curve of κ = 60 outperforms the curves of κ = 5, 20, 40 in terms of the 10% worst-case return in Fig. 5(b) . In practical implementation, we gradually increase κ to



CONCLUSIONIn this paper, we have proposed a robust policy optimization approach, named MRPO, for improving both the average and worst-case performance of policies. Specifically, we theoretically derived a lower bound for the worst-case performance of a given policy over all environments, and formulated an optimization problem to optimize the policy and sampling distribution together, subject to constraints that bounded the update step in policy optimization and statistical distance between the worst and average case environments. We proved that the worst-case performance was monotonically improved by iteratively solving this optimization problem. We have validated MRPO on several robot control tasks, demonstrating a performance improvement on both the worst and average case environments, as well as a better generalization ability to a wide range of unseen environments.



Figure 1: Training curves of average return and 10% worst-case return.

Figure 2: Heatmap of return in unseen environments on Waler2d, Hopper and Halfcheetah with policies trained by MRPO, PW-DR and DR in the training environments.erage performance on Hopper and Halfcheetah and failure on Walker2d. We measure the worst-case performance by computing the worst 10% performance in all the sampled environments at iteration k and the corresponding training curves are illustrated in Figs.1(d)-1(f) and 1(j)-1(l), respectively. It can be observed that MRPO presents the best worst-case performance on Hopper, Cartpole and HalfcheetahBroadRange, while DR neglects the optimization on its worst-case performance and performs badly in these three tasks. PW-DR shows limited improvement of the worst-case performance compared to MRPO on Hopper, Halfcheetah, Cartphole and HalfcheetahBroadRange, and failure on Walker2d. Comparing Figs.1(f) and 1(l), it can be concluded that the original parameter range we set for the Halfcheetah task (e.g., the friction range of [0.5, 1.1]) was too narrow to cause seriously poor performance on the 10% worst-case environments. By enlarging the friction range from [0.5, 1.1] to [0.2, 2.5] for HalfcheetahBroadRange, the training curves in Fig.1(l) can clearly demonstrate that MRPO outperforms the other baselines. The tabular comparison of the average and worst-case performance achieved during training by different algorithms in different tasks can be found in Table2.

3(a) and 3(b), at iteration k = 300, p 300 w = (750, 0.5), the MC estimation of η(π 300 |p 300 w ) is 487.6 and that of η(π 301 |p 300 w ) is 532.0. At iteration k = 301, p 301 w = (1027.8, 0.5) and the MC estimation of η(π 301 |p 301 w ) is 517.6. As shown in Figs. 3(c) and 3(d), at iteration k = 350, p 350 w = (861.1, 0.5), the MC estimation of η(π 350 |p 350 w ) is 385.9 and that of η(π 351 |p 350 w ) is 422.2. At iteration k = 351, p 351 w = (750, 0.5) and the MC estimation of η(π 351 |p 351

Figure 3: Heatmaps of return between policy update at iterations k = 300 and k = 350, using MRPO on Hopper. Algorithm 2 Practical Inplementation of Monotonic Robust Policy Optimization 1: Initialize policy π 0 , uniform distribution of environment parameters U , number of environment parameters sampled per iteration M , maximum number of iterations N and maximum episode length T . 2: for k = 0 to N -1 do 3:

Figure 4: (a) Training curves of average return of MRPO on Hopper with different L; (b) Time elapsed versus number of iterations curves during training.

Figure 5: (a) Training curves of (a) average return and (b) 10% worst-case return of MRPO on Hopper with different κ.

Range of parameters for each environment.

Average and worst-case performance in training, where W, H, C, CP, I and CB refer to Walker2d, Hopper, Halfcheetah, Cartpole, InvertedDoublePendulum and HalfcheetahBroadRange.

