FEASIBLE ADVERSARIAL ROBUST REINFORCEMENT LEARNING FOR UNDERSPECIFIED ENVIRONMENTS

Abstract

Robust reinforcement learning (RL) considers the problem of learning policies that perform well in the worst case among a set of possible environment parameter settings. In real-world environments, choosing the set of allowed parameter settings for robust RL can be a difficult task. When that set is specified too narrowly, the agent will be left vulnerable to reasonable parameter values unaccounted for. When specified too broadly, the agent will be too cautious. In this paper, we propose Feasible Adversarial Robust RL (FARR), a novel problem formulation and objective for automatically determining the set of environment parameter values over which to be robust. FARR implicitly defines the set of feasible parameter values as those on which an agent could achieve a benchmark reward given enough training resources. By formulating this problem as a two-player zero-sum game, optimizing the FARR objective jointly produces an adversarial distribution over parameter values with feasible support and a policy robust over this feasible parameter set. We demonstrate that approximate Nash equilibria for this objective can be found using a variation of the PSRO algorithm. Furthermore, we show that an optimal agent trained with FARR is more robust to feasible adversarial parameter selection than with existing minimax, domain-randomization, and regret objectives in a parameterized gridworld and three MuJoCo control environments.

1. INTRODUCTION

Recent advancements in deep reinforcement learning (RL) show promise for the field's applicability to control in real-world environments by training in simulation (OpenAI et al., 2018; Hu et al., 2021; Li et al., 2021) . In such deployment scenarios, details of the test-time environment layout and dynamics can differ from what may be experienced at training time. It is important to account for these potential variations to achieve sufficient generalization and test-time performance. Robust RL methods train on an adversarial distribution of difficult environment variations to attempt to maximize worst-case performance at test-time. This process can be formulated as a two-player zero-sum game between the primary learning agent, the protagonist, which is a maximizer of its environment reward, and a second agent, the adversary, which alters and affects the environment to minimize the protagonist's reward (Pinto et al., 2017) . By finding a Nash equilibrium in this game, the protagonist maximizes its worst-case performance over the set of realizable environment variations. This work aims to address the growing challenge of specifying the robust adversarial RL formulation in complex domains. On the one hand, the protagonist's worst-case performance guarantee only applies to the set of environment variations realizable by the adversary. It is therefore desirable to allow an adversary to represent a large uncertainty set of environment variations. On the other hand, care has to be taken to prevent the adversary from providing unrealistically difficult conditions. If an adversary can pose an insurmountable problem in the protagonist's training curriculum, under a standard robust RL objective, it will learn to do so, and the protagonist will exhibit overly cautious behavior at test-time (Ma et al., 2018) . As the complexity of environment variations representable in simulation increases, the logistic difficulty of well-specifying the limits of the adversary's abilities is exacerbated. For example, consider an RL agent for a warehouse robot trained to accomplish object manipulation and retrieval tasks in simulation. It is conceivable that the developer cannot accurately specify 1

