HALENTNET: MULTIMODAL TRAJECTORY FORE-CASTING WITH HALLUCINATIVE INTENTS

Abstract

Motion forecasting is essential for making intelligent decisions in robotic navigation. As a result, the multi-agent behavioral prediction has become a core component of modern human-robot interaction applications such as autonomous driving. Due to various intentions and interactions among agents, agent trajectories can have multiple possible futures. Hence, the motion forecasting model's ability to cover possible modes becomes essential to enable accurate prediction. Towards this goal, we introduce HalentNet to better model the future motion distribution in addition to a traditional trajectory regression learning objective by incorporating generative augmentation losses. We model intents with unsupervised discrete random variables whose training is guided by a collaboration between two key signals: A discriminative loss that encourages intents' diversity and a hallucinative loss that explores intent transitions (i.e., mixed intents) and encourages their smoothness. This regulates the neural network behavior to be more accurately predictive on uncertain scenarios due to the active yet careful exploration of possible future agent behavior. Our model's learned representation leads to better and more semantically meaningful coverage of the trajectory distribution. Our experiments show that our method can improve over the state-of-the-art trajectory forecasting benchmarks, including vehicles and pedestrians, for about 20% on average FDE and 50% on road boundary violation rate when predicting 6 seconds future. We also conducted human experiments to show that our predicted trajectories received 39.6% more votes than the runner-up approach and 32.2% more votes than our variant without hallucinative mixed intent loss.

1. INTRODUCTION

The ability to forecast trajectories of dynamic agents is essential for a variety of autonomous systems such as self-driving vehicles and social robots. It enables an autonomous system to foresee adverse situations and adjust motion planning accordingly to prefer better alternatives. Because agents can make different decisions at any given time, future motion distribution is inherently multi-modal. Due to incomplete coverage of different modes in real data and interacting agents' combinatorial nature, trajectory forecasting is challenging. Several existing works focus on formulating the multi-modal future prediction only from training data (e.g., (Tang & Salakhutdinov, 2019; Alahi et al., 2016; Casas et al., 2019; Deo & Trivedi, 2018; Sadeghian et al., 2018; Casas et al., 2019; Salzmann et al., 2020) ). This severely limits the ability of these models to predict modes that are not covered beyond the training data distribution, and some of these learned modes could be spurious especially where the real predictive spaces are not or inadequately covered by the training data. To improve the multimodal prediction quality, our goal is to enrich the coverage of these less explored spaces, while encouraging plausible behavior. Properly designing this exploratory learning process for motion forecasting as an implicit data augmentation approach is at the heart of this paper. Most data augmentation methods are geometric and operate on raw data. They also have been mostly studied on discrete label spaces like classification tasks (e.g., (Zhang et al., 2017; Yun et al., 2019; Wang et al., 2019; Cubuk et al., 2019; Ho et al., 2019; Antoniou et al., 2017; Elhoseiny & Elfeki, 2019; Mikołajczyk & Grochowski, 2019; Ratner et al., 2017) ). In contrast, we focus on a multi-agent future forecasting task where label space for each agent is spatial-temporal. To our knowledge, augmentation techniques are far less explored for this task. Our work builds on recent advances in trajectory prediction problem (e.g., Tang & Salakhutdinov (2019); Salzmann et al. ( 2020)), that leverage discrete latent variables to represent driving behavior/intents (e.g. Turn left, speed up). Inspired by these advances, we propose HalentNet, a sequential probabilistic latent variable generative model that learns from both real and implicitly augmented multi-agent trajectory data. More concretely, we model driving intents with discrete latent variables z. Then, our method hallucinates new intents by mixing different discrete latent variables up in the temporal dimension to generate trajectories that are realistic-looking but different from training data judged by discriminator Dis to implicitly augment the behaviors/intents. The nature of our augmentation approach is different from existing methods since it operates on the latent space that represents the agent's behavior. The training of these latent variables is guided by a collaboration between discriminative and hallucinative learning signals. The discriminative loss increases the separation between intent modes; we impose this as a classification loss that recognizes the one-hot latent intents corresponding to the predicted trajectories. We call these discriminative latent intents as classified intents since they are easy to classify to an existing one-hot latent intent (i.e., low entropy). This discriminative loss expands the predictive intent space that we then encourage to explore by our hallucinated intents' loss. As we detail later, we define hallucinated intents as a mixture of the one-hot classified latent intents. We encourage the predictions of trajectories corresponding to hallcuinated intents to be hard to classify to the one-hot discrete latent intents by hallucinative loss but, in the meantime, be realistic with a real/fake loss that we impose. The classification, hallucinative, and real/fake losses are all defined on top of a Discriminator Dis, whose input is the predicted motion trajectories and the map information. We show that all these three components are necessary to achieve good performance, where we also ablate our design choices. Our contributions are summarized as follows. • We introduce a new framework that enables multi-modal trajectory forecasting to learn dynamically complementary augmented agent behaviors. • We introduce the notion of classified intents and hallucinated intents in motion forecasting that can be captured by discrete latent variables z. We introduce two complementary learning mechanism for each to better model latent behavior intentions and encourage the novelty of augmented agent behaviors and hence improve the generalization. The classified intents ẑ is defined not to change over time and are encouraged to be well separated from other classified intents with a classification loss. The hallucinated intents ẑh , on the other hand, changes over the prediction horizon and are encouraged to deviate from the classified intents as augmented agent behaviors. • Our experiments demonstrate at most 26% better results measured by average FDE compared to other state-of-the-art methods on motion forecasting datasets, which verifies the effectiveness of our methods. We also conducted human evaluation experiments showing that our forecasted motion is considered 39% safer than the runner-up approach. Codes, pretrained models and preprocessed datasets are available at https://github.com/ Vision-CAIR/HalentNet 

2. RELATED WORK

Trajectory Forecasting Trajectory forecasting of dynamic agents has received increasing attention recently because it is a core problem to a number of applications such as autonomous driving and social robots. Human motion is inherently multi-modal, recent work (Lee et al., 2017; Cui et al., 2018; Chai et al., 2019; Rhinehart et al., 2019; Kosaraju et al., 2019; Tang & Salakhutdinov, 2019; Ridel et al., 2020; Salzmann et al., 2020; Huang et al., 2019; Mercat et al., 2019) has focused on learning the distribution from multi-agent trajectory data. (Cui et al., 2018; Chai et al., 2019; Ridel et al., 2020; Mercat et al., 2019) predicts multiple future trajectories without learning low dimensional latent agent behaviors. (Lee et al., 2017; Kosaraju et al., 2019; Rhinehart et al., 2019; Huang 

