SAFE REINFORCEMENT LEARNING WITH NATURAL LANGUAGE CONSTRAINTS

Abstract

In this paper, we tackle the problem of learning control policies for tasks when provided with constraints in natural language. In contrast to instruction following, language here is used not to specify goals, but rather to describe situations that an agent must avoid during its exploration of the environment. Specifying constraints in natural language also differs from the predominant paradigm in safe reinforcement learning, where safety criteria are enforced by hand-defined cost functions. While natural language allows for easy and flexible specification of safety constraints and budget limitations, its ambiguous nature presents a challenge when mapping these specifications into representations that can be used by techniques for safe reinforcement learning. To address this, we develop a model that contains two components: (1) a constraint interpreter to encode natural language constraints into vector representations capturing spatial and temporal information on forbidden states, and (2) a policy network that uses these representations to output a policy with minimal constraint violations. Our model is end-to-end differentiable and we train it using a recently proposed algorithm for constrained policy optimization. To empirically demonstrate the effectiveness of our approach, we create a new benchmark task for autonomous navigation with crowd-sourced freeform text specifying three different types of constraints. Our method outperforms several baselines by achieving 6-7 times higher returns and 76% fewer constraint violations on average. Dataset and code to reproduce our experiments are available at https://sites.google.com/view/polco-hazard-world/.

1. INTRODUCTION

Reinforcement learning (RL) has shown great promise in a variety of control problems including robot navigation (Anderson et al., 2018; Misra et al., 2018) and robotic control (Levine et al., 2016; Rajeswaran et al., 2017) , where the main goal is to optimize for scalar returns. However, as RL is increasingly deployed in many real-world problems, it is imperative to ensure the safety of both agents and their surroundings, which requires accounting for constraints that may be orthogonal to maximizing returns. While there exist several safe RL algorithms (Achiam et al., 2017; Chow et al., 2019; Yang et al., 2020b) in the literature, a major limitation they share is the need to manually specify constraint costs and budget limitations. In many real-world problems, safety criteria tend to be abstract and quite challenging to define, making their specification (e.g., as logical rules or mathematical constraints) an expensive task requiring domain expertise. On the other hand, natural language provides an intuitive and easily-accessible medium for specifying constraints -not just for experts or system developers, but also for potential end users of the RL agent. For example, instead of specifying a safety constraint in the form of "if water not in previously visited states then do not visit lava", one can simply say "Do not visit the lava before visiting the water." The key challenge lies in training the RL agent to interpret natural language and accurately adhere to the constraints during exploration and execution. In this paper, we develop a novel framework for safe reinforcement learning that can handle natural language constraints. This setting is different from traditional instruction following, in which text instructions are used to specify goals for the agent (e.g., "reach the key" or "go forward two steps"). To effectively learn a safe policy that obeys text constraints, we propose a model consisting of two key modules. First, we use a constraint interpreter to encode language constraints into intermediate vector and matrix representations-this captures spatial information of forbidden states and the long-term dependency of the past states. Second, we design a policy network that operates on a combination of these intermediate representations and state observations and is trained using a constrained policy optimization algorithm (e.g., PCPO (Yang et al., 2020b) ). This allows our agent to map the abstract safety criteria (in language) into cost representations that are amendable for safe RL. We call our approach Policy Optimization with Language COnstraints (POLCO). Since there do not exist standard benchmarks for safe RL with language constraints, we construct a new navigation task called Hazard World. Hazard World is a 2D grid world environment with diverse, free-form text representing three types of constraints: (1) budgetary constraints that limit resource usage or the frequency of being in undesirable states (e.g., "The lava is really hot, so it will hurt you a lot. Please only walk on it 3 times"), (2) relational constraints that specify forbidden states in relation to surrounding entities in the environment (e.g., "There should always be at least 3 squares between you and water"), and (3) sequential constraints that depend on past events (e.g., "Grass will surround your boots and protect you from dangerous lava.'). Fig. 1 provides a sample situation from the task. In summary, we make the following key contributions. First, we formulate the problem of safe RL with safety criteria specified in natural language. Second, we propose POLCO, a new policy architecture and two-stage safe RL algorithm that first encodes natural language constraints into quantitative representations and then uses these representations to learn a constraint-satisfying policy. Third, we introduce a new safe RL dataset (Hazard World) containing three broad classes of abstract safety criteria, all described in diverse free-form text. Finally, we empirically compare POLCO against other baselines in Hazard World. We show that POLCO outperforms other baselines by achieving 6-7 times higher returns and 76% fewer constraint violations on average over three types of constraints. We also perform extensive evaluations and analyses of our model and provide insights for future improvements.

2. RELATED WORK

Policy optimization with constraints. Learning constraint-satisfying policies has been explored in prior work in the context of safe RL (see Garcia & Fernandez (2015) for a survey). Typically, the agent learns policies either by (1) exploring the environment to identify forbidden behaviors (Achiam et al., 2017; Tessler et al., 2018; Chow et al., 2019; Yang et al., 2020b; Stooke et al., 2020) , or (2) through expert demonstration data to recognize the safe trajectories (Ross et al., 2011; Rajeswaran et al., 2017; Gao et al., 2018; Yang et al., 2020a) . Critically, these works all require a human to specify the cost constraints manually. In contrast, we use natural language to describe the cost constraints, which allows for easier and more flexible specifications of safety constraint. Instruction following without constraints. Instruction following for 2D and 3D navigation has been explored in the context of deep RL (MacMahon et al., 2006; Vogel & Jurafsky, 2010; Chen & Mooney, 2011; Artzi & Zettlemoyer, 2013; Kim & Mooney, 2013; Andreas & Klein, 2015; Thomason et al., 2020; Luketina et al., 2019; Tellex et al., 2020) . Prior work either focuses on providing a dataset with a real-life visual urban or household environment (e.g., Google street view) (Bisk et al., 2018; Chen et al., 2019; Anderson et al., 2018; de Vries et al., 2018) ; or proposes a computational model to learn multi-modal representations that fuse 2D or 3D images with goal instructions (Janner et al., 2018; Blukis et al., 2018; Fried et al., 2018; Liu et al., 2019; Jain et al., 2019; Gaddy & Klein, 2019; Hristov et al., 2019; Fu et al., 2019) . These work use text to specify goals, not environmental hazards. In contrast, we use language to describe the constraints that the agent must obey. Constraints in natural language. Misra et al. (2018) propose two datasets called LANI and CHAI to study spatial and temporal reasoning, as well as perception and planning. Their dataset contains a few trajectory constraints, which specify goal locations (e.g., "go past the house by the right



Figure 1: Learning to navigate with language constraints. The figure shows (1) three types of language constraints, (2) items which provide rewards when collected, and (3) a third-person view of the environment. The objective is to maximize total reward without violating text constraints.

