CAUSALAGENTS: A ROBUSTNESS BENCHMARK FOR MOTION FORECASTING

Abstract

As machine learning models become increasingly prevalent in motion forecasting for autonomous vehicles (AVs), it is critical to ensure that model predictions are safe and reliable. However, exhaustively collecting and labeling the data necessary to fully test the long tail of rare and challenging scenarios is difficult and expensive. In this work, we construct a new benchmark for evaluating and improving model robustness by applying perturbations to existing data. Specifically, we conduct an extensive labeling effort to identify causal agents, or agents whose presence influences human drivers' behavior in any format, in the Waymo Open Motion Dataset (WOMD), and we use these labels to perturb the data by deleting non-causal agents from the scene. We evaluate a diverse set of state-of-the-art deep-learning model architectures on our proposed benchmark and find that all models exhibit large shifts under even non-causal perturbation: we observe a 25-38% relative change in minADE as compared to the original. We also investigate techniques to improve model robustness, including increasing the training dataset size and using targeted data augmentations that randomly drop non-causal agents throughout training. Finally, we release the causal agent labels as an additional attribute to WOMD and the robustness benchmarks to aid the community in building more reliable and safe deep-learning models for motion forecasting (see supplementary).

1. INTRODUCTION

Machine learning models are increasingly prevalent in trajectory prediction and motion planning tasks for autonomous vehicles (AVs) (Casas et al., 2020; Chai et al., 2019; Cui et al., 2019; Ettinger et al., 2021; Varadarajan et al., 2021; Rhinehart et al., 2019; Lee et al., 2017; Hong et al., 2019; Salzmann et al., 2020; Zhao et al., 2019; Mercat et al., 2020; Khandelwal et al., 2020; Liang et al., 2020) . To safely deploy such models, they must have reliable, robust predictions across a diverse range of scenarios and they must be insensitive to spurious features, or patterns in the data that fail to generalize to new environments. However, collecting and labeling the required data to both evaluate and improve model robustness is often expensive and difficult, in part due to the long tail of rare and difficult scenarios (Makansi et al., 2021) . In this work, we propose perturbing existing data via agent deletions to evaluate and improve model robustness to spurious features. To be useful in our setting, the perturbations must preserve the correct labels and not change the ground truth trajectory of the AV. Since generating such perturbations requires high-level scene understanding as well as causal reasoning, we propose using human labelers to identify irrelevant agents. Specifically, we define a non-causal agent as an agent whose deletion does not cause the ground truth trajectory of a given target agent to change. We then construct a robustness evaluation dataset that consists of perturbed examples where we remove all non-causal agents from each scene, and we study model behavior under alternate perturbations, such as removing causal agents, removing a subset of non-causal agents, or removing stationary agents. Using our perturbed datasets, we then conduct an extensive experimental study exploring how factors such as model architecture, dataset size, and data augmentation affect model sensitivity. We also propose novel metrics to quantify model sensitivity, including one that captures per-example absolute changes between predicted and ground truth trajectories and another that directly reflects how the model outputs change under perturbation via IoU (intersection-over-union) without referring to the ground truth trajectory. The second metric helps to address the issue that the ground truth trajectory is one sample from a distribution of many possibly correct trajectories. Additionally, we visualize scenes with large model sensitivity to understand why performance degrades under perturbations. Our results show that existing motion forecasting models are sensitive to deleting non-causal agents and can have pathological behavior dependencies on faraway or distant agents. For example, Figure 1 illustrates an original (left) and perturbed (right) scenes with non-causal agents removed. In the perturbed example, the model's prediction misses the right-turn mode, which corresponds to the ground-truth trajectory. Such brittleness could lead to serious consequences in autonomous driving systems if we rely on deep-learning models without further safety assurance from other techniques such as optimization and robotics algorithms. The main contributions of our work are as follows: 1. We contribute a new robustness benchmark for the WOMD for evaluating trajectory prediction models' sensitivity to spurious correlations. We release the causal agent labels from human labelers as additional attributes to WOMD so that researchers can utilize the causal relationships between the agents for robustness evaluation and for other tasks such as agents relevance or ranking (Refaat et al., 2019; Tolstaya et al., 2021) . 2. We introduce two metrics to quantify the robustness of motion forecasting models to perturbations, including absolute per-example change in minADE and a trajectory set metric that measures sensitivity without using the ground truth as a reference. 3. We evaluate the robustness of several state-of-the-art motion forecasting models, including Multipath++ (Varadarajan et al., 2021 ), Wayformer (Nayakanti et al., 2022 ) , and SceneTransformer (Ngiam et al., 2021) . We show that the absolute per-example change in minADE can range from 0.07-0.23 m (a significant 25 -38% change relative to the original minADE). We find that all models are sensitive to deleting non-causal agents, and the model with the best overall performance (in terms of regular metrics used to quantify the trajectory prediction performance such as minADE) is not necessarily the most robust. 4. We show that increasing training dataset size and targeted data augmentations that remove non-causal agents can help improve model robustness. Overall, this is the first work focusing on the robustness of trajectory prediction models to perturbations based on human labels. Such robustness is critical for models deployed in a self-driving car where the reliability and safety requirements are of utmost importance. Ultimately, our goal is to provide a robustness benchmark which can aid the community to better evaluate model reliability, detect possible spurious correlations in deep-learning-based trajectory prediction models, and facilitate the development of more robust models or other mitigation techniques such as optimization and traditional robotic algorithms as complementary solutions to minimize safety risks.

2. RELATED WORK

Robustness evaluation on perturbations. Machine learning models are known to have brittle predictions under distribution shift. Across multiple domains, researchers have proposed robustness evaluation protocols that move beyond a fixed test set (Recht et al., 2019; Biggio & Roli, 2018; Szegedy et al., 2013; Hendrycks & Dietterich, 2019; Gu et al., 2019; Shankar et al., 2019; Taori 



Figure 1: Trajectory prediction is sensitive to removing non-causal agents. We show a top-down view of a scene from the WOMD (left) and a perturbed version of the scene where we delete all non-causal agents (right). The AV and its predicted trajectories via the Scene Transformer model (Ngiam et al., 2021) are shown in blue, the ground truth trajectory of the AV is grey, and the ground truth of other agents is green. The perturbation causes a large shift in minADE because the model fails to predict the ground truth mode (a right turn), which indicates the brittleness of the model to such perturbations.

