BOOTSTRAP MOTION FORECASTING WITH SELF-CONSISTENT CONSTRAINTS Anonymous authors Paper under double-blind review

Abstract

We present a novel framework to bootstrap Motion forecastIng with Selfconsistent Constraints (MISC). The motion forecasting task aims at predicting future trajectories of vehicles by incorporating spatial and temporal information from the past. A key design of MISC is the proposed Dual Consistency Constraints that regularize the predicted trajectories under spatial and temporal perturbation during training. Also, to model the multi-modality in motion forecasting, we design a novel self-ensembling scheme to obtain accurate teacher targets to enforce the self-constraints with multi-modality supervision. With explicit constraints from multiple teacher targets, we observe a clear improvement in the prediction performance. Extensive experiments on the Argoverse motion forecasting benchmark show that MISC significantly outperforms the state-of-theart methods. As the proposed strategies are general and can be easily incorporated into other motion forecasting approaches, we also demonstrate that our proposed scheme consistently improves the prediction performance of several existing methods.

1. INTRODUCTION

Motion forecasting has been a crucial task for self-driving vehicles that aims at predicting the future trajectories of agents (e.g., cars, pedestrians) involved in the traffic. The predicted trajectories can further help self-driving vehicles to plan their future actions and avoid potential accidents. Since the future is not deterministic, motion forecasting is intrinsically a multi-modal problem with substantial uncertainties. This implies that an ideal motion forecasting method should produce a distribution of future trajectories or at least multiple most likely ones. Due to the inherent uncertainty, motion forecasting remains challenging and unsolved yet. Recently, researchers have proposed different architectures based on various representations to encode the kinematic states and context information from HDMap in order to generate feasible multi-modal trajectories (Bansal et al., 2019; Chai et al., 2019; Gao et al., 2020; Gu et al., 2021; Liang et al., 2020; Liu et al., 2021; Ngiam et al., 2021; Varadarajan et al., 2021; Ye et al., 2021; Zeng et al., 2021; Zhao et al., 2020) . These methods follow a traditional static training pipeline, where frames of each scenario are split into historical frames (input) and future frames (ground truth) in a fixed pattern. Nevertheless, the prediction task is a streaming task in real-world applications, where the current state will become a historical state as time goes by, and the buffer of the historical state is a queue structure to make successive predicted trajectories. As a result, the temporal consistency thus becomes a crucial requirement for the downstream tasks for fault and noise tolerance. To tackle this issue, trajectory stitching is widely applied in traditional planning algorithms (Fan et al., 2018) to ensure stability along the temporal horizon. However, as the trajectory stitching operation is non-differentiable, it cannot be easily incorporated into learning-based models. Though deep-learning-based models show unprecedented motion prediction performance compared with traditional counterparts, they do not explicitly consider the temporal consistency, leading to unstable behaviors in downstream tasks such as planning. Inspired by these phenomena, we raise a question: can we explicitly enforce the consistency when training a deep motion prediction model? On the one hand, the predicted trajectories should be consistent given the successive inputs along the temporal horizon, namely temporal consistency. On the other hand, the predicted trajectories should be stable and robust against small spatial noise or disturbance, namely spatial consistency. In this work, we propose a self-supervised scheme to enforce consistency constraints in both spatial and temporal domains, namely Dual Consistency Constraints. Our proposed framework, referred as MISC, significantly improves the quality and robustness of motion forecasting, without the need for extra data. On top of the consistency, multi-modality is another core characteristic of the motion prediction task. Existing datasets (Chang et al., 2019; Sun et al., 2020) only provide a single ground-truth trajectory for each scenario, which can not satisfy the multi-choice situations such as junction scenarios. Most methods adopt the winner-takes-all (WTA) (Lee et al., 2016) or its variants (Breuer et al., 2021; Narayanan et al., 2021) to alleviate this situation. However, WTA tends to produce confused predictions when two trajectories are very close. In contrast, our method addresses the multi-modality issue by introducing more powerful teacher targets from self-ensembling. With self-constraint from multiple soft teacher targets, our model is more likely to be exposed to more high-quality samples, bootstrapping each modality. Our contributions are summarized as follows, • We propose Dual Consistency Constraints to enforce temporal and spatial consistency in our model, which is shown to be a general and effective way to improve the overall performance in motion forecasting. • We propose a self-ensembling constraints training strategy that provides multi-modality supervision explicitly during training to enforce self-consistency with teacher targets. • We conduct extensive experiments on the Argoverse (Chang et al., 2019) motion forecasting benchmark and our proposed approach achieves the state-of-the-art results.

2. RELATED WORK

Motion Forecasting. Traditional methods (Houenou et al., 2013; Schulz et al., 2018; Xie et al., 2017; Ziegler et al., 2014) for motion forecasting mainly utilize HDMap information for the prior estimation and Kalman filter (Kalman, 1960) for motion states prediction. With the recent progress of deep learning on big data, more and more works have been proposed to exploit the potential of data mining in motion forecasting. Methods (Bansal et al., 2019; Chai et al., 2019; Duvenaud et al., 2015; Gao et al., 2020; Henaff et al., 2015; Liang et al., 2020; Liu et al., 2021; Shuman et al., 2013; Song et al., 2021; Ye et al., 2021; Zeng et al., 2021) explore different representations, including rasterized image, graph representation, point cloud representation and transformer to generate the features for the task and predict the final output trajectories by regression or post-processing sampling. Most of these works focus on finding more effective and compact ways of feature extraction on the surrounding environment (HDMap information) and agent interactions. Based on these representations, other approaches (Casas et al., 2018; Mangalam et al., 2020; Song et al., 2021; Zeng et al., 2021; 2019; Zhao et al., 2020) try to incorporate the prior knowledge with traditional methods, which take the predefined candidate trajectories from sampling or clustering strategies as anchor trajectories. To some extent, these candidate trajectories can provide better guidance and goal coverage for the trajectories regression due to straightforward HDMap encoding. Nevertheless, this extra dependency makes the stability of models highly related to the quality of the trajectory proposals. Goal-guided approaches (Gilles et al., 2021; Gu et al., 2021; Gilles et al., 2022) are therefore introduced to optimize goals in an end-to-end manner, paired with sampling strategies that generate the final trajectory for better coverage rate. Consistency Regularization. Consistency Regularization has been fully studied in semi-supervised and self-supervised learning. Temporally related works (Wang et al., 2019; Lei et al., 2020; Zhou et al., 2017) have widely explored the idea of cyclic consistency. Most of the works apply pairwise matching to minimize the alignment difference through optical flow or correspondence matching to achieve temporal smoothness. Other works (Bachman et al., 2014; Földiák, 1991; Ouyang et al., 2021; Sajjadi et al., 2016; Wang et al., 2021) apply consistency constraints to predictions from the same input with different transformations in order to obtain perturbation-invariant representations. Our work can be seen as a combination of both types of consistency to fully consider the spatial and temporal continuity in motion forecasting. Multi-hypothesis Learning. Motion forecasting task inherently has multi-modality due to the future uncertainties and difficulties in acquiring accurate ground-truth labels. WTA (Guzman-Rivera

