MONOTONIC ROBUST POLICY OPTIMIZATION WITH MODEL DISCREPANCY

Abstract

State-of-the-art deep reinforcement learning (DRL) algorithms tend to overfit in some specific environments due to the lack of data diversity in training. To mitigate the model discrepancy between training and target (testing) environments, domain randomization (DR) can generate plenty of environments with a sufficient diversity by randomly sampling environment parameters in simulator. Though standard DR using a uniform distribution improves the average performance on the whole range of environments, the worst-case environment is usually neglected without any performance guarantee. Since the average and worst-case performance are equally important for the generalization in RL, in this paper, we propose a policy optimization approach for concurrently improving the policy's performance in the average case (i.e., over all possible environments) and the worst-case environment. We theoretically derive a lower bound for the worst-case performance of a given policy over all environments. Guided by this lower bound, we formulate an optimization problem which aims to optimize the policy and sampling distribution together, such that the constrained expected performance of all environments is maximized. We prove that the worst-case performance is monotonically improved by iteratively solving this optimization problem. Based on the proposed lower bound, we develop a practical algorithm, named monotonic robust policy optimization (MRPO), and validate MRPO on several robot control tasks. By modifying the environment parameters in simulation, we obtain environments for the same task but with different transition dynamics for training and testing. We demonstrate that MRPO can improve both the average and worst-case performance in the training environments, and facilitate the learned policy with a better generalization capability in unseen testing environments.

1. INTRODUCTION

With deep neural network approximation, deep reinforcement learning (DRL) has extended classical reinforcement learning (RL) algorithms to successfully solving complex control tasks, e.g., playing computer games with human-level performance (Mnih et al., 2013; Silver et al., 2018) and continuous robotic control (Schulman et al., 2017) . By random exploration, DRL often requires tremendous amounts of data to train a reliable policy. It is thus infeasible for many tasks, such as robotic control and autonomous driving, as training in the real world is not only time-consuming and expensive, but also dangerous. Therefore, training is often conducted on a very limited set of samples, resulting in overfitting and poor generalization capability. One alternative solution is to learn a policy in a simulator (i.e., source/training environment) and then transfer it to the real world (i.e., target/testing environment). Currently, it is impossible to model the exact environment and physics of the real world. For instance, the physical effects like nonrigidity and fluid dynamics are quite difficult to be accurately modeled by simulation. How to mitigate the model discrepancy between the training and target environments remains challenging for the generalization in RL. To simulate the dynamics of the environment, domain randomization (DR), a simple but effective method is proposed. It randomizes the simulator (e.g., by randomizing the distribution of environment parameters) to generate a variety of environments for training the policy in the source domain. Compared with training in a single environment, recent researches have shown that policies learned through an ensemble of environment dynamics obtained by DR achieve better generalization performance with respect to the expected return. The expected return is referred to as the average per-formance across all the trajectories sampled from different environments. Since these trajectories, regardless of their performance, are uniformly sampled, the trajectories with the worst performance would severely degrade the overall performance. In contrast, another line of research on the generalization in RL is from the perspective of control theory, i.e., learning policies that are robust to environment perturbations. Robust RL algorithms learn policies, also using model ensembles produced by perturbing the parameters of the nominal model. EPOpt (Rajeswaran et al., 2017) , a representative of them, trains policy solely on the worst performing subset, i.e., trajectories with the worst α percentile of returns, while discarding all the higher performing trajectories. In other words, it seeks a higher worst-case performance at the cost of degradation on the average performance. In general, robust RL algorithms may sacrifice performance on many environment variants and focus only on environments with the worst performance, such that the policy learned will not behave very badly in a previously unseen environment. In this paper, we focus on the generalization issue in RL, and aim to mitigate the model discrepancy of the transition dynamics between the training and target environments. Considering that both the average and worst-case performance are equally important for evaluating the generalization capability of the policy, we propose a policy optimization approach in which the distribution of the sampled trajectories are specifically designed for concurrently improving both the average and worst-case performance. Our main contributions are summarized as follows. • For a given policy and a wide range of environments, we theoretically derive a lower bound for the worst-case expected return of that policy over all the environments, and prove that maximizing this lower bound (equivalent to maximizing the worst-case performance) can be achieved by solving an average performance maximization problem, subject to constraints that bound the update step in policy optimization and statistical distance between the worst and average case environments. To the best of our knowledge, this theoretical analysis of the relationship between the worst-case and average performance is reported for the first time, which provides a practical guidance for updating policies towards both the worst-case and average performance maximization. • Trajectories obtained from diverse environments may contribute differently to the generalization capacity of the policy. Therefore, in face of a huge amount of trajectories, the problem that which types of trajectories are likely to mostly affect the generalization performance should be considered. Unlike traditional uniform sampling without the worst-case performance guarantee, and different from the worst α percentile sampling in which the parameter α is empirically preset, we propose a criterion for the sampling trajectory selection based on the proposed worst-case and average performance maximization, with which both the environment diversity and the worst-case environments are taken into account. • Based on the proposed theorem, we develop a monotonic robust policy optimization (MRPO) algorithm to learn the optimal policy with both the maximum worst-case and average performance. Specifically, MRPO carries out a two-step optimization to update the policy and the distribution of the sampled trajectories, respectively. We further prove that the policy optimization problem can be transformed to trust region policy optimization (TRPO) (Schulman et al., 2015) on all possible environments, such that the policy update can be implemented by the commonly used proximal policy optimization (PPO) algorithm (Schulman et al., 2017) . Finally, we prove that by updating the policy with the MRPO, the worst-case expected return can be monotonically increased. • To greatly reduce the computational complexity, we impose Lipschitz continuity assumptions on the transition dynamics and propose a practical implementation of MRPO. We then conduct experiments on five robot control tasks with variable transition dynamics of environments, and show that MRPO can improve both the average and worst-case performance in the training environments compared to DR and Robust RL baselines, and significantly facilitate the learned policy with a better generalization capability in unseen testing environments.

2. BACKGROUND

Under the standard RL setting, the environment is modeled as a Markov decision process (MDP) defined by a tuple < S, A, T , R >. S is the state space and A is the action space. For the convenience of derivation, we assume they are finite. T : S × A × S → [0, 1] is the transition dynamics determined by the environment parameter p ∈ P, where P denotes the environment pa-

