CONBAT: CONTROL BARRIER TRANSFORMER FOR SAFETY-CRITICAL POLICY LEARNING

Abstract

Large-scale self-supervised models have recently revolutionized our ability to perform a variety of tasks within the vision and language domains. However, using such models for autonomous systems is challenging because of safety requirements: besides executing correct actions, an autonomous agent needs to also avoid high cost and potentially fatal critical mistakes. Traditionally, selfsupervised training mostly focuses on imitating previously observed behaviors, and the training demonstrations carry no notion of which behaviors should be explicitly avoided. In this work, we propose Control Barrier Transformer (ConBaT), an approach that learns safe behaviors from demonstrations in a self-supervised fashion. ConBaT is inspired by the concept of control barrier functions in control theory and uses a causal transformer that learns to predict safe robot actions autoregressively using a critic that requires minimal safety data labeling. During deployment, we employ a lightweight online optimization to find actions that can ensure future states lie within the safe set. We apply our approach to different simulated control tasks and show that our method results in safer control policies compared to other classical and learning-based methods.

1. INTRODUCTION

Mobile robots are finding increasing use in complex environments through tasks such as autonomous navigation, delivery, and inspection (Ning et al., 2021; Gillula et al., 2011) . Any unsafe behavior such as collisions in the real world carries a great amount of risk while potentially resulting in catastrophic outcomes. Hence, robots are expected to execute their actions in a safe, reliable manner while achieving the desired tasks. Yet, learning safe behaviors such as navigation comes with several challenges. Primarily, notions of safety are often indirect and only implicitly found in datasets, as it is customary to show examples of optimal actions (what the robot should do) as opposed to giving examples of failures (what to avoid). In fact, defining explicit safety criteria in most realworld scenarios is a complex task and requires deep domain knowledge (Gressenbuch & Althoff, 2021; Braga et al., 2021; Kreutzmann et al., 2013) . In addition, learning algorithms can struggle to directly infer safety constructs from high-dimensional observations, as most robots do not operate with global ground truth state information. We find examples of safe navigation using both classical and learning-based methods. Classical methods often rely on carefully crafted models and safety constraints expressed as optimization problems, and require expensive tuning of parameters for each scenario (Zhou et al., 2017; Van den Berg et al., 2008; Trautman & Krause, 2010) . The challenges in translating safety definitions into rules make it challenging to deploy classical methods in complex settings. The mathematical structure of such planners can also make them prone to adversarial attacks (Vemprala & Kapoor, 2021) . Within the domain of safe learning-based approaches we see instances of reinforcement and imitation learning leveraging safety methods (Brunke et al., 2022; Turchetta et al., 2020) , and also learning applied towards reachability analysis and control barrier functions (Herbert et al., 2021; Luo et al., 2022) . The application of learning-based approaches to safe navigation is significantly hindered by the fact that while expert demonstrations may reveal one way to solve a certain task, they do not often reflect which types of unsafe behaviors should be avoided by the agent. We can draw similarities and differences with other domains: natural language (NL) and vision models can learn how to generate grammatically correct text or temporally consistent future image frames by following patterns consistent with the training corpus. However, for control tasks, notions of safety are less evident from demonstrations. While the cost of a mistake is not fatal in NL and vision, when it comes to autonomous navigation we find that states that disobey the safety rules can have significant negative consequences to a physical system. Within this context, our paper aims to take a step toward answering a fundamental question: how can we use agent demonstrations to learn a policy that is both effective for the desired task and also respects safety-critical constraints? Recently, the success of large language models (Vaswani et al., 2017; Brown et al., 2020) has inspired the development of a class of Transformer-based models for decision-making which uses auto-regressive losses over sequences of demonstrated state and action pairs (Reed et al., 2022; Bonatti et al., 2022) . While such models are able to learn task-specific policies from expert data, they lack a clear notion of safety and are unable to explicitly avoid unsafe actions. Our work builds upon this paradigm of large autoregressive Transformers applied on perception-action sequences, and introduces methods to learn policies in a safety-critical fashion. Our method, named Control Barrier Transformer (ConBaT), takes inspiration in barrier functions from control theory (Ames et al., 2019) . Our architecture consists of a causal Transformer augmented with a safety critic that implicitly models a control barrier function to evaluate states for safety. Instead of relying on a complex set of hand-defined safety rules, our proposed critic only requires a binary label of whether a certain demonstration is deemed to be safe or unsafe. This control barrier critic then learns to map observations to a continuous safety score, inferring safety constraints in a self-supervised way. If a proposed action from the policy is deemed unsafe by the critic, ConBaT attempts to compute a better action that ensures the safety of reachable states. A lightweight optimization scheme operates on the critic values to minimally modify the proposed action and result in a safer alternative, a process inspired by optimal control methods and enabled by the fully differentiable fabric of the model. Unlike conventional formulations of the control barrier function, our critic operates in the embedding space of the transformer as opposed to ground truth states, making it applicable to a wide variety of systems. We list our contributions below: • We propose the Control Barrier Transformer (ConBaT) architecture, built upon causal Transformers with the addition of a differentiable safety critic inspired by control barrier functions. Our model can be trained auto-regressively with state-action pairs and can be applied to safety-critical applications such as safe navigation. We present a loss formulation that enables the critic to map latent embeddings to a continuous safety value, and during deployment we couple the critic with a lightweight optimization scheme over possible actions to keep future states within the safe set. • We apply our method to two simulated environments: a simplified F1 car simulator upon which we perform several analyses, and a simulated LiDAR-based vehicle navigation scenario. We compare our method with imitation learning and reinforcement learning baselines, as well as a classical model-predictive control method. We show that ConBaT results in relatively lower collision rates and longer safe trajectory lengths. • We show that novel safety definitions (beyond collision avoidance) can be quickly learned by ConBaT with minimal data labeling. Using new demonstrations we can finetune the safety critic towards new constraints such as 'avoid driving in straight lines', and the adapted policy shifts towards the desired distribution.



Figure 1: (Left) An agent trained to imitate expert demonstrations may just focus on the end result of the task without explicit notions of safety. (Right) Our proposed method ConBaT learns a safety critic on top of the control policy and uses the critic's control barrier to actively optimize the policy for safe actions.

