CAUSALAGENTS: A ROBUSTNESS BENCHMARK FOR MOTION FORECASTING

Abstract

As machine learning models become increasingly prevalent in motion forecasting for autonomous vehicles (AVs), it is critical to ensure that model predictions are safe and reliable. However, exhaustively collecting and labeling the data necessary to fully test the long tail of rare and challenging scenarios is difficult and expensive. In this work, we construct a new benchmark for evaluating and improving model robustness by applying perturbations to existing data. Specifically, we conduct an extensive labeling effort to identify causal agents, or agents whose presence influences human drivers' behavior in any format, in the Waymo Open Motion Dataset (WOMD), and we use these labels to perturb the data by deleting non-causal agents from the scene. We evaluate a diverse set of state-of-the-art deep-learning model architectures on our proposed benchmark and find that all models exhibit large shifts under even non-causal perturbation: we observe a 25-38% relative change in minADE as compared to the original. We also investigate techniques to improve model robustness, including increasing the training dataset size and using targeted data augmentations that randomly drop non-causal agents throughout training. Finally, we release the causal agent labels as an additional attribute to WOMD and the robustness benchmarks to aid the community in building more reliable and safe deep-learning models for motion forecasting (see supplementary).

1. INTRODUCTION

Machine learning models are increasingly prevalent in trajectory prediction and motion planning tasks for autonomous vehicles (AVs) (Casas et al., 2020; Chai et al., 2019; Cui et al., 2019; Ettinger et al., 2021; Varadarajan et al., 2021; Rhinehart et al., 2019; Lee et al., 2017; Hong et al., 2019; Salzmann et al., 2020; Zhao et al., 2019; Mercat et al., 2020; Khandelwal et al., 2020; Liang et al., 2020) . To safely deploy such models, they must have reliable, robust predictions across a diverse range of scenarios and they must be insensitive to spurious features, or patterns in the data that fail to generalize to new environments. However, collecting and labeling the required data to both evaluate and improve model robustness is often expensive and difficult, in part due to the long tail of rare and difficult scenarios (Makansi et al., 2021) . In this work, we propose perturbing existing data via agent deletions to evaluate and improve model robustness to spurious features. To be useful in our setting, the perturbations must preserve the correct labels and not change the ground truth trajectory of the AV. Since generating such perturbations requires high-level scene understanding as well as causal reasoning, we propose using human labelers to identify irrelevant agents. Specifically, we define a non-causal agent as an agent whose deletion does not cause the ground truth trajectory of a given target agent to change. We then construct a robustness evaluation dataset that consists of perturbed examples where we remove all non-causal agents from each scene, and we study model behavior under alternate perturbations, such as removing causal agents, removing a subset of non-causal agents, or removing stationary agents. Using our perturbed datasets, we then conduct an extensive experimental study exploring how factors such as model architecture, dataset size, and data augmentation affect model sensitivity. We also propose novel metrics to quantify model sensitivity, including one that captures per-example absolute changes between predicted and ground truth trajectories and another that directly reflects how the model outputs change under perturbation via IoU (intersection-over-union) without referring to the ground truth trajectory. The second metric helps to address the issue that the ground truth trajectory is one sample from a distribution of many possibly correct trajectories. Additionally, we visualize scenes with large model sensitivity to understand why performance degrades under perturbations.

