REVISITING DOMAIN RANDOMIZATION VIA RELAXED STATE-ADVERSARIAL POLICY OPTIMIZATION

Abstract

Domain randomization (DR) is widely used in reinforcement learning (RL) to bridge the gap between simulation and reality through maximizing its average returns under the perturbation of environmental parameters. Although effective, the methods have two limitations: (1) Even the most complex simulators cannot capture all details in reality due to finite domain parameters and simplified physical models. (2) Previous methods often assume that the distribution of domain parameters is a specific family of probability functions, such as a normal or a uniform distribution, which may not be correct. To enable robust RL via DR without the aforementioned limitations, we rethink DR from the perspective of adversarial state perturbation, without the need for re-configuring the simulator or relying on prior knowledge about the environment. We point out that perturbing agents to the worst states during training is naïve and could make the agents over-conservative. Hence, we present a Relaxed State-Adversarial Algorithm to tackle the over-conservatism issue by simultaneously maximizing the average-case and worst-case performance of policies. We compared our method to the state-of-the-art methods for evaluation. Experimental results and theoretical proofs verified the effectiveness of our method.

1. INTRODUCTION

Most reinforcement learning (RL) agents are trained in simulated environments due to the difficulties of collecting data in real environments. However, the domain shift, where the simulated and real environments are different, could significantly reduce the agents' performance. To bridge this "reality gap", domain randomization (DR) methods perturb environmental parameters (Tobin et al., 2017; Rajeswaran et al., 2016; Jiang et al., 2021) , such as the mass or the friction coefficient, to simulate the uncertainty in state transition probabilities and expect the agents to maximize the return over the perturbed environments. Despite its wide applicability, DR suffers from two practical limitations: (i) DR requires direct access to the underlying parameters of the simulator, and this could be infeasible if only off-the-shelf simulation platforms are available. (ii) To enable sampling of environmental parameters, DR requires a prior distribution over the feasible environmental parameters. However, the design of such a prior typically relies on domain knowledge and could significantly affect the performance in real environments. To enable robust RL via DR without the above limitations, we rethink DR from the perspective of adversarial state perturbation, without the need for re-configuring the simulator or relying on prior knowledge about the environment. The idea is that perturbing the transition probabilities can be equivalently achieved by imposing perturbations upon the states after nominal state transitions. To substantiate the idea of state perturbations, a simple and generic approach from the robust optimization literature (Ben-Tal & Nemirovski, 1998) is taking a worst-case viewpoint and perturbing the states to nearby states that have the lowest long-term expected return under the current policy (Kuang et al., 2021) . While being a natural solution, such a worst-case strategy could suffer from severe over-conservatism. We identify that the over-conservative behavior results from the tight coupling between the need for temporal difference (TD) learning in robust RL and the worst-case operation of state perturbation. Specifically: (1) In robust RL, the value functions are learned with the help of bootstrapping in TD methods since finding nearby worst-case states via Monte-Carlo sampling is NP-hard (Ho et al., 2018; Chow et al., 2015; Behzadian et al., 2021) . ( 2) Under the worst-case state perturbations, TD methods would update the value function based on the local minimum within a neighborhood of the nominal next state and is, therefore, completely unaware of the value of the Figure 1 : We illustrate the over-conservative issue of the naive worst-case state-adversarial policy optimization using a 4 ⇥ 4 shortest-path grid world environment. The star, cross, and dot represent the goal, the trap, and the initial state, respectively. The terminal rewards of the trap and the goal are 10 and 0. We use an arrow to represent the action that has the highest value at each state. Multiple arrows in a state indicate that the actions have equal Q-values. We also use the color to indicate the value of the best action at each state. In (a), the agent trained by the naïve worst-case state-adversarial approach fails to learn how to reach the goal. What's worse, under TD updates, the worst-case state perturbation makes the trap state indistinguishable from other states. As a result, the agent ultimately learns to move toward the trap state after 12 training iterations. In (b), our relaxed state-adversarial approach avoids the over-conservatism issue by considering both the average-case and worst-case environments. We refer readers to Appendix A.1 for the step-by-step evolution of the value functions. nominal next state. As a result, the learner could fail to identify or explore those states with potentially high returns. To further illustrate this phenomenon, we consider a toy grid world example of finding the shortest path toward the goal, as shown in Figure 1(a) . Although the goal state has a high value, the TD updates cannot propagate the value to other states since all nominal state transitions toward the goal state are perturbed away under the worst-case state-adversarial method. What's even worse, the agent ultimately learns to move toward the trap state due to the compounding effect of TD updates and worst-case state-adversarial perturbations. Notably, in addition to the grid world environment, such trap terminal states also commonly exist in various RL problems, such as the locomotion tasks in MuJoCo. As a result, there remains one critical unanswered question in robust RL: how to fully unleash the power of the state-adversarial model in robustifying RL algorithms without suffering from over-conservatism? To answer this question, we introduce relaxed state-adversarial perturbations. Specifically: (1) Instead of taking a pure worst-case perspective, we simultaneously consider both the average-case and worst-case scenarios during training. By incorporating the average-case scenarios, the TD updates can successfully propagate the values of those potentially high-return states to other states and thereby prevent the over-conservative behavior (Figure 1(b) ). (2) To substantiate the above idea, we introduce a relaxed state-adversarial transition kernel, where the average-case environment can be easily represented by the interpolation of the nominal and the worst-case environments. Under this new formulation of DR, each interpolation coefficient corresponds to a distribution of state adversaries. (3) Besides, based on this formulation, we theoretically quantify the performance gap between the average-case and the worst-case environments; and prove that maximizing the averagecase performance can also benefit the worst-case performance. (4) Accordingly, we present Relaxed state-adversarial policy optimization, a bi-level framework that optimizes the rewards of the two cases alternatively and iteratively. One level updates the policy to maximize the average-case performance, and the other updates the interpolation coefficient of the relaxed state-adversarial transition kernel to increase the lower bound of the return of the worst-case environment.

2. RELATED WORK

Robust Markov Decision Process (MDP) and Robust RL. Robust MDP aims to maximize rewards in the worst situations if the testing environment deviates from the training environment (Nilim & El Ghaoui, 2005; Iyengar, 2005; Wiesemann et al., 2013) . Due to the large searching space, the complexity of robust MDP grows rapidly when the dimensionality increases. Therefore, Tamar et al. (2014) developed an approximated dynamic programming to scale up the robust MDPs paradigm. Roy et al. (2017) extended the method to nonlinear estimation and guaranteed the convergence to a regional minimum. Afterward, the works of (Wang & Zou, 2021; Badrinath & Kalathil, 2021) 

