HALENTNET: MULTIMODAL TRAJECTORY FORE-CASTING WITH HALLUCINATIVE INTENTS

Abstract

Motion forecasting is essential for making intelligent decisions in robotic navigation. As a result, the multi-agent behavioral prediction has become a core component of modern human-robot interaction applications such as autonomous driving. Due to various intentions and interactions among agents, agent trajectories can have multiple possible futures. Hence, the motion forecasting model's ability to cover possible modes becomes essential to enable accurate prediction. Towards this goal, we introduce HalentNet to better model the future motion distribution in addition to a traditional trajectory regression learning objective by incorporating generative augmentation losses. We model intents with unsupervised discrete random variables whose training is guided by a collaboration between two key signals: A discriminative loss that encourages intents' diversity and a hallucinative loss that explores intent transitions (i.e., mixed intents) and encourages their smoothness. This regulates the neural network behavior to be more accurately predictive on uncertain scenarios due to the active yet careful exploration of possible future agent behavior. Our model's learned representation leads to better and more semantically meaningful coverage of the trajectory distribution. Our experiments show that our method can improve over the state-of-the-art trajectory forecasting benchmarks, including vehicles and pedestrians, for about 20% on average FDE and 50% on road boundary violation rate when predicting 6 seconds future. We also conducted human experiments to show that our predicted trajectories received 39.6% more votes than the runner-up approach and 32.2% more votes than our variant without hallucinative mixed intent loss.

1. INTRODUCTION

The ability to forecast trajectories of dynamic agents is essential for a variety of autonomous systems such as self-driving vehicles and social robots. It enables an autonomous system to foresee adverse situations and adjust motion planning accordingly to prefer better alternatives. Because agents can make different decisions at any given time, future motion distribution is inherently multi-modal. Due to incomplete coverage of different modes in real data and interacting agents' combinatorial nature, trajectory forecasting is challenging. Several existing works focus on formulating the multi-modal future prediction only from training data (e.g., (Tang & Salakhutdinov, 2019; Alahi et al., 2016; Casas et al., 2019; Deo & Trivedi, 2018; Sadeghian et al., 2018; Casas et al., 2019; Salzmann et al., 2020) ). This severely limits the ability of these models to predict modes that are not covered beyond the training data distribution, and some of these learned modes could be spurious especially where the real predictive spaces are not or inadequately covered by the training data. To improve the multimodal prediction quality, our goal is to enrich the coverage of these less explored spaces, while encouraging plausible behavior. Properly designing this exploratory learning process for motion forecasting as an implicit data augmentation approach is at the heart of this paper. * Work done prior to Amazon. 1

