CCIL: CONTEXT-CONDITIONED IMITATION LEARNING FOR URBAN DRIVING

Abstract

Imitation learning is a promising solution to the challenging autonomous urban driving task as experienced human drivers can effortlessly tackle highly complex driving scenarios. Behavior cloning is the most widely applied imitation learning approach in autonomous driving due to its exemption from potentially risky online interactions, but it suffers from the covariate shift issue. To mitigate this problem, we propose a context-conditioned imitation learning approach that learns a policy to map the context state into the ego vehicle's state instead of the typical formulation from both ego and context state to the ego action. Besides, to make full use of the spatial and temporal relations in the context to infer the ego future states, we design a novel policy network based on the Transformer, whose attention mechanism has demonstrated excellent performance in capturing relations. Finally, during evaluation, a linear quadratic controller is employed to produce smooth planning based on the predicted states from the policy network. Experiments on the real-world large-scale Lyft and nuPlan datasets demonstrate that our method can surpass the state-of-the-art method significantly.

1. INTRODUCTION

Planning a safe, comfortable, and efficient trajectory in a complex urban environment for a selfdriving vehicle (SDV) is an important and challenging task in autonomous driving (Yurtsever et al., 2020) . Unlike highway driving (Henaff et al., 2019) , urban driving requires handling more varied road geometry such as roundabouts and intersections while interacting with traffic lights, pedestrians, and other vehicles. Classic manually-designed rule-based approaches (Fan et al., 2018) have achieved some success in industry but demand tedious human engineering to struggle with diverse real-world cases. Meanwhile, the rapid development of deep learning techniques motivates researchers (Bojarski et al., 2016; Pan et al., 2020) to employ a deep neural network to model the complicated driving policy. To learn such a policy, imitation learning (IL) from human drivers' demonstrations is a promising solution as experienced drivers can tackle even extremely challenging situations, and their driving data can be collected at scale. The simplest IL algorithm is the behavior cloning (BC) method, which has wide applications in autonomous driving (Pomerleau, 1988; Bojarski et al., 2016; Codevilla et al., 2018b) due to its exemption from potentially dangerous online interactions. It learns a policy in a supervised fashion by minimizing the difference between the learner's action and the expert's action in the expert state distribution. However, the BC method suffers from the covariate shift issue (Ross et al., 2011) , i.e. the state induced by the learner's policy cumulatively deviates from the expert's distribution. To overcome the covariate shift obstacle, existing methods such as DAgger (Ross et al., 2011) and DART (Laskey et al., 2017) query supervisor corrections at the learner's states or perturbed expert's states. Since human supervision is hard to collect, recent works like GAIL (Ho & Ermon, 2016) seek to provide feedback from a neural network-based discriminator to recover from out-ofdistribution states generated by the learner's policy. However, these data augmentation methods need either expert supervision or rolling out their policy in the real world or a realistic simulator, which are impractical in autonomous driving. Instead, some researchers attempt to constrain the learned policy formulation to ensure its robustness to the policy error by incorporating control theoretic prior knowledge. For example, Palan et al. ( 2020); Havens & Hu (2021) pose Kalman or linear matrix inequality constraints on the learned linear policy to guarantee its closed-loop stability in a linear time-invariant (LTI) system. Yin et al. ( 2021) relaxes the linear policy formulation into a simple feed-forward neural network representation and East (2022) extends the method in Havens & Hu (2021) to polynomial policy and dynamical system. However, the urban driving task is too complex to be handled by these naive policy formulations. To learn a stable and general urban driving policy by imitating only the offline human demonstrations, we propose a context-conditioned imitation learning (CCIL) method, where a policy network is learned to predict the SDV's future states using only its observed context, different from classic policy taking both ego state and context as input to generate action (Codevilla et al., 2019; Bansal et al., 2019) . The motivation is that the ego state can be easily influenced by the policy's error, thus leading to the catastrophic distribution shift. On the contrary, the static context elements such as lanes or crosswalks will not be influenced by the SDV and the dynamic context element like human drivers will try to recover from its perturbation. Thus, based on the stability assumption of the traffic system, we can prove theoretically that our policy formulation can achieve closed-loop stability, thus addressing the distribution shift issue. In practice, as it becomes challenging to accurately plan a trajectory for a SDV without its historical trajectory, we construct our policy network based on Transformer (Vaswani et al., 2017) to exploit spatial and temporal relation information in highly interactive and constrained urban driving scenarios. Furthermore, during evaluation, we employ a linear-quadratic regulator (LQR) ( Åström & Murray, 2021) to yield a smooth action based on the SDV's current and predicted states. The main contributions of this paper can be summarized as follows: 1. To address the covariate shift issue in offline imitation learning, we propose a novel contextconditioned imitation learning method, where a policy is learned to output the ego state using only its context as input. A robustness assurance is provided for our policy formulation based on an assumption of the context's stability. 2. To apply our method to urban driving, we remove the explicit ego state information input and propose a new ego-perturbed goal-oriented coordinate system to reduce the implicit ego information in the coordinate system. Besides, we design a Transformer-based planning network to make full use of the spatial and temporal information in the context. 3. To verify the effectiveness of our approach, we benchmark the real-world large-scale urban driving Lyft (Houston et al., 2020) and nuPlan (Caesar et al., 2021) datasets with state-of-the-art performance. The video and code can be found at https://sites.google.com/view/ contextconditionedil.

2.1. IMITATION LEARNING FOR AUTONOMOUS DRIVING

The objective of applying IL in autonomous driving is to learn driving behavior mimicking human drivers. The most straightforward solution is BC which minimizes the difference between the learner's and the expert's action on the expert state without demanding extra manually labeled data and online interaction. Early BC applications in autonomous driving such as ALVINN (Pomerleau, 1988) and PilotNet (Bojarski et al., 2016) learn an end-to-end policy that directly maps sensor inputs to vehicle control commands using a large amount of human driving experience. Recently, Chauf-feurNet (Bansal et al., 2019) provides intermediate planning using perception results to improve generalization and transparency. However, the BC approach typically leads to the covariate shift between the training distribution and deploying distribution, as minor errors in the policy can lead to deviating from the expert state and then larger errors. IL methods to address the distribution shift challenge can be categorized into online methods and offline model-free and model-based methods. Online methods try to directly match the expert stateaction distribution instead of matching the expert state conditioned action distribution like BC. For instance, the method in Zhang & Cho (2017) based on DAgger (Ross et al., 2011) queries supervisor actions at the state the learner visits and then adds the new data into the dataset, thus adjusting the expert state distribution to match the learner's state distribution. To get rid of the requirement of interactive expert, several methods like Wang et al. (2021) based on adversarial imitation learning (Ho & Ermon, 2016) utilize a discriminator to measure the difference between learner and expert's state-

