HALENTNET: MULTIMODAL TRAJECTORY FORE-CASTING WITH HALLUCINATIVE INTENTS

Abstract

Motion forecasting is essential for making intelligent decisions in robotic navigation. As a result, the multi-agent behavioral prediction has become a core component of modern human-robot interaction applications such as autonomous driving. Due to various intentions and interactions among agents, agent trajectories can have multiple possible futures. Hence, the motion forecasting model's ability to cover possible modes becomes essential to enable accurate prediction. Towards this goal, we introduce HalentNet to better model the future motion distribution in addition to a traditional trajectory regression learning objective by incorporating generative augmentation losses. We model intents with unsupervised discrete random variables whose training is guided by a collaboration between two key signals: A discriminative loss that encourages intents' diversity and a hallucinative loss that explores intent transitions (i.e., mixed intents) and encourages their smoothness. This regulates the neural network behavior to be more accurately predictive on uncertain scenarios due to the active yet careful exploration of possible future agent behavior. Our model's learned representation leads to better and more semantically meaningful coverage of the trajectory distribution. Our experiments show that our method can improve over the state-of-the-art trajectory forecasting benchmarks, including vehicles and pedestrians, for about 20% on average FDE and 50% on road boundary violation rate when predicting 6 seconds future. We also conducted human experiments to show that our predicted trajectories received 39.6% more votes than the runner-up approach and 32.2% more votes than our variant without hallucinative mixed intent loss.

1. INTRODUCTION

The ability to forecast trajectories of dynamic agents is essential for a variety of autonomous systems such as self-driving vehicles and social robots. It enables an autonomous system to foresee adverse situations and adjust motion planning accordingly to prefer better alternatives. Because agents can make different decisions at any given time, future motion distribution is inherently multi-modal. Due to incomplete coverage of different modes in real data and interacting agents' combinatorial nature, trajectory forecasting is challenging. Several existing works focus on formulating the multi-modal future prediction only from training data (e.g., (Tang & Salakhutdinov, 2019; Alahi et al., 2016; Casas et al., 2019; Deo & Trivedi, 2018; Sadeghian et al., 2018; Casas et al., 2019; Salzmann et al., 2020) ). This severely limits the ability of these models to predict modes that are not covered beyond the training data distribution, and some of these learned modes could be spurious especially where the real predictive spaces are not or inadequately covered by the training data. To improve the multimodal prediction quality, our goal is to enrich the coverage of these less explored spaces, while encouraging plausible behavior. Properly designing this exploratory learning process for motion forecasting as an implicit data augmentation approach is at the heart of this paper. Most data augmentation methods are geometric and operate on raw data. They also have been mostly studied on discrete label spaces like classification tasks (e.g., (Zhang et al., 2017; Yun et al., 2019; Wang et al., 2019; Cubuk et al., 2019; Ho et al., 2019; Antoniou et al., 2017; Elhoseiny & Elfeki, 2019; Mikołajczyk & Grochowski, 2019; Ratner et al., 2017) ). In contrast, we focus on a multi-agent future forecasting task where label space for each agent is spatial-temporal. To our knowledge, augmentation techniques are far less explored for this task. Our work builds on recent advances in trajectory prediction problem (e.g., Tang & Salakhutdinov (2019) ; Salzmann et al. (2020) ), that leverage discrete latent variables to represent driving behavior/intents (e.g. Turn left, speed up). Inspired by these advances, we propose HalentNet, a sequential probabilistic latent variable generative model that learns from both real and implicitly augmented multi-agent trajectory data. More concretely, we model driving intents with discrete latent variables z. Then, our method hallucinates new intents by mixing different discrete latent variables up in the temporal dimension to generate trajectories that are realistic-looking but different from training data judged by discriminator Dis to implicitly augment the behaviors/intents. The nature of our augmentation approach is different from existing methods since it operates on the latent space that represents the agent's behavior. The training of these latent variables is guided by a collaboration between discriminative and hallucinative learning signals. The discriminative loss increases the separation between intent modes; we impose this as a classification loss that recognizes the one-hot latent intents corresponding to the predicted trajectories. We call these discriminative latent intents as classified intents since they are easy to classify to an existing one-hot latent intent (i.e., low entropy). This discriminative loss expands the predictive intent space that we then encourage to explore by our hallucinated intents' loss. As we detail later, we define hallucinated intents as a mixture of the one-hot classified latent intents. We encourage the predictions of trajectories corresponding to hallcuinated intents to be hard to classify to the one-hot discrete latent intents by hallucinative loss but, in the meantime, be realistic with a real/fake loss that we impose. The classification, hallucinative, and real/fake losses are all defined on top of a Discriminator Dis, whose input is the predicted motion trajectories and the map information. We show that all these three components are necessary to achieve good performance, where we also ablate our design choices. Our contributions are summarized as follows. • We introduce a new framework that enables multi-modal trajectory forecasting to learn dynamically complementary augmented agent behaviors. • We introduce the notion of classified intents and hallucinated intents in motion forecasting that can be captured by discrete latent variables z. We introduce two complementary learning mechanism for each to better model latent behavior intentions and encourage the novelty of augmented agent behaviors and hence improve the generalization. The classified intents ẑ is defined not to change over time and are encouraged to be well separated from other classified intents with a classification loss. The hallucinated intents ẑh , on the other hand, changes over the prediction horizon and are encouraged to deviate from the classified intents as augmented agent behaviors. 

2. RELATED WORK

Trajectory Forecasting Trajectory forecasting of dynamic agents has received increasing attention recently because it is a core problem to a number of applications such as autonomous driving and social robots. Human motion is inherently multi-modal, recent work (Lee et al., 2017; Cui et al., 2018; Chai et al., 2019; Rhinehart et al., 2019; Kosaraju et al., 2019; Tang & Salakhutdinov, 2019; Ridel et al., 2020; Salzmann et al., 2020; Huang et al., 2019; Mercat et al., 2019) has focused on learning the distribution from multi-agent trajectory data. (Cui et al., 2018; Chai et al., 2019; Ridel et al., 2020; Mercat et al., 2019) predicts multiple future trajectories without learning low dimensional latent agent behaviors. (Lee et al., 2017; Kosaraju et al., 2019; Rhinehart et al., 2019; Huang et al., 2019) encodes agent behaviors in continuous low dimensional latent space while (Tang & Salakhutdinov, 2019; Salzmann et al., 2020) uses discrete latent variables. Discrete latent variables succinctly capture semantically meaningful modes such as turn left, turn right. (Tang & Salakhutdinov, 2019; Salzmann et al., 2020) learns discrete latent variables without explicit labels. Built on top of these recent work, we hallucinate possible future behaviors by changing agent intents. As the forecast horizon is a few seconds, these are highly plausible. We use a discriminator to encourage augmented trajectories to look real. Data augmentation Data augmentation is a popular technique to mitigate overfitting and improve generalization in training deep networks (Shorten & Khoshgoftaar, 2019) . New data is typically generated by transforming real data samples in the original input space. These transformations range from simple techniques (e.g. random flipping, mirroring for images, mixup (Zhang et al., 2017) and Cutmix (Yun et al., 2019) (Liu et al., 2018; Wang et al., 2019; Li et al., 2020a) has also been proposed. ISDA (Wang et al., 2019) proposes a loss function to implicitly translate training samples along with semantic directions in the feature space. For example, a certain direction corresponds to the semantic translation of "make-bespectacled." When a person's feature without glasses is translated along this direction, the new feature may correspond to the same person but with glasses. MoEx (Li et al., 2020a) proposes a new augmentation method that leverages the first and second moments extracted and re-injected by feature normalization. Specifically, it replaces the moments of the learned features of one training image by those of another and interpolates the target labels. Our data augmentation is also in the latent space, which represents agent behavior. Imaginative/Hallucinative models. GANs (Goodfellow et al., 2014; Radford et al., 2015) are a powerful generative model, yet they are not explicitly trained to go beyond the training data to improve generalization. Inspired by the theory of human creativity (Martindale, 1990) , recent approaches on generative models were proposed to encourage novel visual content generation in art and fashion designs. In (Elgammal et al., 2017) , the authors adapted GANs to generate unconditional creative content (paintings) by encouraging the model to deviate from existing painting styles. In the fashion domain, (Sbai et al., 2018) showed that their model is capable of producing a non-existing shape like "pants to extended arm sleeves" that some designers found interesting. The key mechanism in these methods is the addition of a deviation loss, which encourages the generator to produce novel content. More recently, (Elhoseiny & Elfeki, 2019) proposed a method for understanding unseen classes, also known as zero-shot learning (ZSL), by generating visual representations of synthesized unseen class descriptors. These visual representations are encouraged to deviate from seen classes, leading to better generalization compared to earlier generative ZSL methods. (Zhang et al., 2019) and (Li et al., 2020b) introduced methods to generate additional data based on saliency maps and adversarial learning for few-shot learning task, respectively. In the field of navigation, (Xiao et al., 2020a) and (Xiao et al., 2020b) utilized geometric information to hallucinate new navigation training data. In contrast to these earlier methods, our work has two key differences. First, our work is a sequential probabilistic generative model focusing on motion forecasting requiring time-series prediction in continuous space. Second, the deviation signal in (Elgammal et al., 2017; Sbai et al., 2018; Elhoseiny & Elfeki, 2019 ) is based on defining labeled discrete seen styles and seen classes, respectively. In contrast, we model the deviation from a discrete latent space guided by a deviation signal to help the model imagine driver intents without supervision signal. Similar to MoEx (Li et al., 2020a) , augmented trajectories are dynamically generated during training. We believe we are the first to propose a data augmentation method in latent space for trajectory forecasting.

3. METHOD

Problem Formulation We are aiming at predicting the future trajectory y gt of a specified agent given the input states x, which contains the historical information like positions and heading angles of the agent itself and the surrounding agents, and a semantic map patch m, which offers context information like drivable region, by generating a distribution P(y|x, m) to model the distribution of real future trajectory y gt . Figure 1 : Overview of our architecture. The generator is trained to infer the behavior intents z and forecast the future trajectories. In addition to a GAN loss and a prediction loss like MLE, we propose classified latent intent behavior that classifies the latent code ẑ behind trajectories, and hallucinative learning that generates novel and plausible trajectories by mix two latent codes. White and color points denote the ground truth and generated trajectories, respectively. Our Model Motion forecasting in the real world is a multi-modal task. There are usually multiple possible futures given the same state. To accurately model this diversity, we define a latent code z to represent different intents of the predicted agent inspired by literature (e.g., (Tang & Salakhutdinov, 2019; Salzmann et al., 2020) ). We denote the input state as x, a local map as m, and the corresponding ground truth future as y gt . The possible behaviors are modeled by the distribution of latent code z conditioned on the input state and the map P(z|x, m). Then, the predicted trajectory distribution is calculated by conditioning on both input state and the latent code P(y|z, x, m). For motion forecasting tasks, we use maximum likelihood estimation (MLE) loss on the ground truth future as the learning objective L = -log P(y gt |x, m). Note that we do not have the label for the latent code z in the dataset. Similar to (Tang & Salakhutdinov, 2019; Salzmann et al., 2020) , we represent latent code as a discrete random variable. The learning objective can be rewritten as follows. L = -log P(y gt |x, m) = -log i [P(z i |x, m)P(y gt |z i , x, m)] Hence, we obtain an unsupervised latent code z that captures some uncertainty of the future without knowing its label. The distribution P (z|x, m) can be modeled by any model that outputs a categorical distribution. P (y|z, x, m) is usually modeled by models that output multivariate Gaussian distribution. An overview of our model can be found in Fig. 1 . It consists of two sub-networks, a generator module and a discriminator module described in the following paragraphs. Generator The generator is the prediction model that produces the future trajectory distribution P (y|x, m) given agent states x and a local map m. As the possible future is multi-modal, the output distribution should model this uncertainty. As we discussed earlier, we model distribution P(z|x, m) and P (y|z, x, m) by neural networks. We use a discrete random variable to represent the latent code z. The uncertainty of the future trajectory can be factorized hierarchically into intent uncertainty and control uncertainty (Chai et al., 2019) . The intent uncertainty reflects different intents or behavior modes of the agent. Furthermore, the control uncertainty covers other minor noise. As the simple Gaussian distribution P (y|z, x, m) is not expressive enough to model the complex uncertainty of multi-modal behaviors, this framework encourages the latent code distribution P(z|x, m) to cover more the intent uncertainty. We denote the modules that generate the latent code distribution and the trajectory distribution as encoder Enc θ and decoder Dec φ with parameter θ and φ, respectively. In addition, Enc θ also encodes agent states x and the local map m into a feature vector e, which is part of the decoder's input. Note that our method does not introduce further restrictions for the model structure. Any model that fits this framework can be used as our generator like MFP (Tang & Salakhutdinov, 2019) and Trajectron++ (Salzmann et al., 2020) . In our experiments, we select Tra-jectron++ as our generator. Its original learning objective L traj++ is shown in Appx.B. In summary, the process of the generator can be represented by the following equations. P(z|x, m), e = Enc θ (x, m) ẑ ∼ P(z|x, m) (2) P(y|ẑ, x, m) = Dec φ (ẑ, e) Discriminator The discriminator Dis ψ with parameters ψ takes either a real trajectory y gt or a generated one ŷ sampled from our predicted distribution together with the local map m as input to judge whether the trajectory is real or generated following GAN framework (Goodfellow et al., 2014) . This helps the decoder inject map information into the learning signal and alleviate the violation of road boundaries in prediction. Besides, we add a classification head to the discriminator. When the input data is generated, this head needs to recognize the latent code ẑ the generator used for creating the input trajectory. In this way, the generator is forced to increase the difference among latent codes and give us more distinct and semantically meaningful driving strategies. This is further discussed in the following paragraphs. The following equation describes the function of our discriminator. D(y), P(z|y, m) = Dis ψ (y, m) (3) Trajectory y is either the ground truth future y gt or the sample from predicted trajectory distribution ŷ. D(y) is the score to indicate whether y is real or synthetic. P(z|y, m) is the classified distribution. Our discriminator is modified from the one in DCGAN (Radford et al., 2015) by adding a fully-connected classification head at the end to classify the latent code. Trajectories y are transformed into the format that convolutional layers can handle via differentiable rasterizer trick (Wang et al., 2020) and stacked together with the local map m as the input for the discriminator as described in Appx.B. Learning Methods Our architecture can be trained by a GAN learning objective, together with the original learning loss of the generator module depends on the model we choose to combine with.  L D = (1-D(y gt )) 2 +(D(ŷ)) 2 +L c φ = φ -α∇ φ L D // generator , x ∼ Dataset Generate hallucinated trajectory ŷh ∼ P(y|ẑ h , x, m) L G,c = (D(ŷ) -1) 2 + L c L G,h = (D(ŷ h ) -1) 2 + L h θ = θ -β∇ θ (λL G,c + (1 -λ)L G,h ) y gt , x ∼ Dataset θ = θ - β∇ θ (L traj++ ) // Trajectron++ loss end Classified latent Intent Behavior In the real world, humans can recognize different behavior intents by looking at the trajectories. To encourage the latent code to contain more information about the intent uncertainty and less about the control uncertainty, we mimic this phenomenon and let the discriminator classify the latent code behind the generated trajectories. The classification function can be trained by a cross-entropy loss. L c =i ẑi log zi , where ẑ ∼ P(z|x, m) (4) z i denotes the i-th dimension of the vector z. ẑ is the latent code under the input trajectory, which is a one-hot vector sampled from the multinoulli distribution P(z|x, m). z denotes the classified categorical distribution generated by our discriminator. Minimizing this loss encourages the decoder to widen the difference among predictions from different z to reduce the classification difficulty for the discriminator. Therefore, we reduce the overlap among output distributions from different latent codes and sharpen them to increase accuracy. Since our model is trained to classify trajectory into ẑ, we name ẑ classified intent. Note that this loss is only applied for generated trajectories since we do not have the latent code for ground truth trajectory. Hallucinative Learning Latent codes z are trained to model intents in the training data. Each predicted trajectory ŷ is calculated from single latent code ẑ for all the prediction steps. Assume the predicted horizon is T , ŷ = [ŷ 1 , ŷ2 , ..., ŷT ], ŷ * at each step is generated conditioned on the same ẑ. And our discriminator is trained to recognize ẑ given the synthetic trajectory ŷ by the classification loss. Besides, the MLE loss encourages synthetic trajectories to be similar to the training data. Therefore, the discriminator implicitly classifies the training data into one of the latent code z. We propose a novel way to utilize this property and learn beyond the training data by encouraging the model to generate trajectories from unfamiliar driving behaviors. This is done by first sampling a second different latent code ẑ in addition to the original one ẑ and randomly selecting a time step t h . The prediction until time step t h in this case ([ŷ 1 , ..., ŷt h ]) is conditioned on the first latent code ẑ and we switch to ẑ for the remaining steps ([ŷ t h +1 , ..., ŷT ]). By this way, we hallucinate a new intent by stacking 2 learned intents in the temporal dimension. We denote this mixed hallucinated intent as ẑh and name it hallucinated intent. The predicted distribution from such a intent is denoted as P(y|ẑ h , x, m). We aim to encourage the hallucinative trajectories ŷh to be plausible but different from the training data. To achieve this, we minimize the cross entropy between the uniform distribution and our intent class distribution. L h = - i 1 N log zi N indicates the number of latent codes. zi is the i-th dimention of the classified distribution z. It encourages the hallucinative trajectory to be hard to be classified into any latent code z, and therefore, to be different from the training data. The plausibility of the hallucinative trajectory is encouraged by the additional GAN loss. In this way, we implicitly apply data augmentation in the latent space to train a more powerful discriminator and improve the generator prediction quality. We call this method hallucinative learning inspired from literature (e.g., Hariharan & Girshick (2017) ). Training We use LSGAN (Mao et al., 2017) loss with spectral normalization (Miyato et al., 2018) as our GAN learning objective. We also keep the original Trajectron++ learning loss L traj++ to maintain the performance in case Trajectron++ is our generator. The combination of GAN learning, training of the original generator, classified latent intent behavior, and hallucinative learning is demonstrated in Alg.2 (Detailed version in Appx.F). We use a hyperparameter λ to balance the training between classification learning and hallucinative learning for the generator by adjusting the weighting of the learning loss.

4. EXPERIMENTAL RESULTS

We compare the performance of our method with state-of-the-art models. To demonstrate our method's performance in complex scenarios, we focus on evaluating the nuScenes dataset (Caesar et al., 2019a) which contains about 1000 driving scenes in 2 cities (Boston and Singapore) with dense traffic. Each scene of them has annotations for pedestrians and vehicles, sampled at a rate of 2 Hz, and about 20 seconds long (40 frames). Besides, both cities include maps, which are required in our method. In addition, we also evaluate our method on widely-used pedestrian datasets ETH (Pellegrini et al., 2009) and UCY (Leal-Taixé et al., 2014) . Evaluation Metrics We use average l 2 displacement error (ADE) and final l 2 displacement error (FDE) to evaluate the prediction performance. Each of them contains some sub-versions. ADE-ML/FDE-ML is the ADE/FDE calculated using the most likely predicted trajectories. In minADEk/minFDE-k, we select k candidate trajectories for each prediction and use all candidates' minimal value as the final score. ADE-Full/FDE-Full represents the quality of output distribution. To compute ADE-Full/FDE-Full, we randomly sample 2k trajectories and calculate the average score. Model Setting Our models are trained in two different scenarios. In the first scenario, we train the model totally from scratch, and in the second one, we finetune on a pretrained generator and train the discriminator from scratch. The number of latent code z is set as 25 latent codes following (Salzmann et al., 2020) . Our method is trained for 23 epochs with the pretrained generator and 35 epochs from scratch for vehicles. The training with a pretrained model lasts about 16 hours with a single NVIDIA V100 graphic card and about 24 hours from scratch. Comparison Methods We compare our contribution to state-of-the-art methods. S-LSTM (Alahi et al., 2016) uses LSTM to predict trajectories and pool the hidden states among agents to model their interaction. CSP (Deo & Trivedi, 2018) discretizes behaviors into a fixed number of classes and predict the best possible behaviors. CAR-Net (Sadeghian et al., 2018) utilizes visual attention mechanism to encodes the surrounding environment and SpAGNN (Casas et al., 2019) detects agents first from LIDAR and semantic map. Then, a graph neural network decoder interactively predicts their trajectories. Trajectron++ (Salzmann et al., 2020) encodes surrounding vehicles using a graph neural network model and infers the behavior intents to produce a multi-modal prediction. As Trajectron++ is the best model among these baselines, we perform an extensive comparison with it using the released pretrained model. nuScenes Dataset We run extensive experiments on the nuScenes dataset (Caesar et al., 2019b) to evaluate and analyze our trajectory forecasting performance and verify model ability to learn dynamically complementary augmented agent behaviors. In this task setting, the model forecasts 3 seconds future with maximal 4 seconds of history information during training. However, the prediction horizon for evaluation is up to 6 seconds to demonstrate our model's generalization capacity. NuScenes dataset contains many agent categories like adult pedestrian and truck. We group them into two semantic classes vehicle and pedestrian, train individual models on them and report the performance separately, following (Salzmann et al., 2020) . Our method achieves the best performance compared to other state-of-the-art approaches on the FDE with minimal 4 seconds of future information during testing; see Tab.1. Due to the instability of GAN, we remove the diverging training cases and average the numbers over 3 stable runs. Although other methods do not report values at 2s and 4s, we can see that the performance of HalentNet increases and HalentNet outperforms existing approaches as we predict more time steps in the future. The complementary learning mechanism and hallucinated intents show a noticeable improvement in vehicle trajectory prediction. We run more experiments to examine further our method's performance and Trajectron++ (Salzmann et al., 2020) as Trajectron++ outperforms other baseline approaches. We used various metrics with the prediction horizon from 1 second to 6 seconds for all tracked objects with at least 6-second available future data. The evaluation results are demonstrated in Tab.3. Our method outperforms Trajectron++ in almost all metrics with a significant margin. Besides, the methods also generalize well when we extend the prediction horizon. We obtain about 26% on average FDE over the output distribution (FDE Full) and 52% for the road boundary violation improvement over the baseline model in the 6-second prediction case. Superiority is when the prediction horizon is more extended. HalentNet trajectories show more respect to road boundaries and output plausible trajectories produced from hallucinated intents that are changed over the prediction horizon and are encouraged to deviate from the classified intents. The evaluation on the pedestrian nuScenes benchmark is listed in Tab.2. We obtain a 8% improvement in the 4s prediction horizon case. Pedestrian Datasets To further demonstrate our performance, we train our model on two widely used pedestrian datasets; ETH (Pellegrini et al., 2009) and UCY (Leal-Taixé et al., 2014) . UCY (no map). UCY does not provide map information that is important for our method. We still test our method in this case in Tab. 10 in Appendix E. This can be viewed as a variant of our model since the map is not provided. We observe a slight improvement in the FDE results with about 7% over Trajectron++. As we show later, the improvement is more significant when map information is used that we think is available in most cases. ETH (with map). We split the data by 70%, 15%, and 15% as a training set, validation set, and test set, separately. Then, we combine these two sets as one big dataset and train both our method and Trajectron++ from scratch with map information. The assessment uses an observation period of 8 timesteps (3.2s) and a projected horizon of 12 timesteps (4.8s). The results are shown in Tab.4. Our method is significantly better than Trajectron++ with an improvement of about 20% on minFDE-20.

Ablation Study

To better demonstrate and understand each component's effect in our model, we create model variants by removing the evaluated components step by step and showing their performance. The evaluation is on the nuScenes dataset with the vehicle prediction for all tracked objects with at least 6-second available future data. The results are listed in Tab.5. Dis, L c , and L h denote the discriminator, the classification learning, and the hallucination learning, respectively. Compared to the variant without all the components we list, the model with the discriminator outperforms it by 15% on average FDE over the output distribution (FDE Full) and 30% for the road boundary violation. The FDE of most likely prediction is also better after 3 seconds. This indicates that the discriminator helps to improve the quality of output distribution. One of the possible reasons is the injection of the map info. Although the generator takes the local map m as input, we do not guarantee that the plain model will use it. As a trajectory that violates the road boundary can be easily recognized as fake data by the discriminator, the map info is injected into the GAN learning objective. Hence, optimizing this loss helps to push map information into output distribution. We observe that we can not gain additional improvement when we add the classification loss L c . We think this is because L c only encourages the classified intents ẑ to be distinguishable from each other. And this property doesn't have a clear relationship to the performance. Our method benefits from the implicit behavior augmentation by the hallucinated intent. When we implicitly augment the data by hallucinative intent loss L h , mixing intents during training with ẑh by combining classified intents, we observe a further boost in the performance. The FDE is more than 5% better than the discriminator only variant Dis, and the road boundary violation is about 20-30% better, showing the effectiveness of the hallucinative learning; see Table 5 . Note that although L c alone does not improve, it is still important to encourage the hallucinative signal L h to be more explorative. This is since the exploration of L h depends on the classified intents' diversity that L c increases. Mathematically, the classification loss L c encourages reducing the entropy of the categorical output distribution over z, and the hallucinative loss promotes that mixing these intents can still be plausi- Fig. 2 shows our exploration of how to balance the classification learning and hallucinative learning. λ represents the importance of classification learning. When λ = 1, our method is reduced to the variant without hallucinative learning. We set the λ in Alg.2 from 0.0 to 1.0 for training separately and plot the corresponding average FDE over the output distribution. The results suggest that properly balancing classified latent intent behavior and hallucinative learning helps improve performance. Hallucinative loss L h defined in Eq.5 is used to encourage the classification difficulty of the hallucinated trajectory ŷh . L h is defined as the cross entropy between the uniform distribution and the classification results in our method. Here we denote our original design choise as L h (def ault) In addition to this design choice, we also experiment with another 2 possibilities: L h (mixup) and L h (N + 1). L h (mixup) is defined as the cross entropy between a discrete distribution that only has non-zero probabilities on the 2 latent codes (probabilities equal 50% for both) used together as the hallucinated intent ẑh and the classification results. For L h (N + 1), we define a new class label for all the hallucinated trajectories. L h (N + 1) is the cross entropy between this new class and the prediction results. Results are shown in Tab.6. Our design choice achieves the best performance, but the TwoHot variant also shows comparable results. The performance of AdditionClass is much worse compared to our design and TwoHot. Human Evaluation We use Amazon Mechanical Turk to evaluate the quality of our prediction. We randomly selected 150 paired scenes, each of which is evaluated by five human subjects on MTurk who are requested to judge which model predicts better trajectory given a scene. Each scene is evaluated by 5 times. Therefore, each comparison contains 750 votes in total. Our method generates better trajectories compared to our variant without hallucinative learning measured by 32.2% more votes and Trajectron++ measured by 39.6% more votes. Results shown in Tab.7.

5. CONCLUSION

In this paper, we propose HalentNet, a probabilistic latent variable framework that hallucinates novel trajectories via transformations in discrete latent agent behavior space. Our method contains two complementary learning mechanisms that encourage a diverse and novel generation to regulate the neural network behavior and achieve more accurate predictions on uncertain scenarios. We show that HalentNet can significantly improve generalization for multi-modal future predictions in multi-agent settings and reduces the boundary violation metric by more than 50%.

6. ACKNOWLEDGEMENT

This work is funded by a KAUST BAS/1/1685-01-0. The authors wish to thank Amazon Mechanical Turkers without who helped with our human studies.

D QUALITATIVE RESULTS

Here, we demonstrate qualitative results of our method compared with Trajectron++ for 4 seconds prediction, trained for 3 seconds prediction. We randomly sample 50 trajectories from model for each prediction, use kernel density estimation to approximate the total output distribution from the samples, and print it out in Fig. 3 . The ground truth trajectories are represented by white points. Compared to Trajectron++, our method reduces the uncertainty of the future by a large margin and also increase the accuracy. The classified latent intent behavior helps us to widen the difference among trajectories from different behavior intents. To demonstrate this, we plot out trajectories for every latent intents, totally 25 intents including the unlikely latent intents given the input data, for both our method and Trajec-tron++ in Fig. 6 . The white points are ground truth trajectories. The red trajectories are the behaviors ẑ with at least 5% probability (p(ẑ|x, m) ≥ 0.05). The gray trajectories are behaviors which are less possible to occur (p(ẑ|x, m) < 0.05). From the visualization we can see that the latent behaviors in our method are more diverse and distinguishable compared to Trajectron++. Here, we train our model on the pedestrians dataset ETH (Pellegrini et al., 2009) and UCY (Leal-Taixé et al., 2014) without map information. A leave-one-out technique is used for evaluation, similar to previous work (Alahi et al., 2016; Gupta et al., 2018; Ivanovic & Pavone, 2019; Kosaraju et al., 2019; Sadeghian et al., 2019; Salzmann et al., 2020) , where the model is trained in four datasets and tested in the fifth dataset. The assessment uses an observation period of 8 timesteps (3.2s) and a projected horizon of 12 timesteps (4.8s). Note that different from experiments in nuScenes dataset, our model is trained from scratch here. We show in table 10 our performance on the UCY datasets. In addition, the model's deterministic ML output scheme is used, which produces the most likely single trajectory of the model. With only using the the notion of classified intents and hallucinated intents that can be captured by a discrete latent vector ẑ, we see a slight improvement in the FDE results with almost 7% over Trajectron++.  ################################# 2 # TRAINING # 3 #################################



Figure 2: Balancing classified latent intent behavior and hallucinative learning by selecting a proper λ = 0.5 in Algo.2 helps to get a best FDE ( averaged over 2000 random samples)

Figure 3: Qualitative results of our method and Trajectron++. Compared to Trajectron++, our method significantly reduces the uncertainty of the prediction in all scenes with improved accuracy. White points denotes the ground truth trajectories.

Figure4: The trajectories from all behavior intents generated by our method and Trajectron++. We force the model to predict trajectory for all behaviors no matter the behaviors are possible judged by the model or not. White points denote the ground truth trajectories. The other points denote predicted trajectories with different behavior intents. With the help of classified latent intent behavior, we obtain more diverse behaviors compared to Trajectron++. Note that red points comes from the intents which are likely under the judgement of models given the input data. The gray points comes from intents which are very unlikely to happen and we forcibly set it for demonstration. Note that Trajectory++ predicts unsafe trajectory with a high likelihood. While Our method have a capability to predict diverse trajectories but unsafe modes have a vert low likelihood

------------discriminator --------------#15 # 1. Compute the high level featurese 16 # and the distribution of latent codez.

17

Sample the classified i n t e n t z 18 # 3. convert the action into trajectories 19 # by integration model 20 # 4. Discriminator judge whether the given 21 # trajectory is real/fake and classified to which z 22 # 5. z represents the classification result P(z| y ,m). step th to assemble hallucinated intent zh 42 # 2. Generate hallucinated trajectory yh 43 # 3. Make yh hard to be classified by 44 # reducing the cross entropy between 45 # a uniform distribution and the classification results zh .

Our experiments demonstrate at most 26% better results measured by average FDE compared to other state-of-the-art methods on motion forecasting datasets, which verifies the effectiveness of our methods. We also conducted human evaluation experiments showing that our forecasted motion is considered 39% safer than the runner-up approach. Codes, pretrained models and preprocessed datasets are available at https://github.com/ Vision-CAIR/HalentNet

To learn a better behavior representation and improve the quality of predicted trajectory distribution, we introduce two new methods Classified latent Intent Behavior and Hallucinative Latent Intent for training.

nuScenes: Vehicle prediction

nuScenes: Pedestrian prediction

nuScenes: Detailed comparison with Trajectron++. ∆% is relative improvement.

With Map. We evaluate our model and Trajec-tron++ (both trained from scratch) on the combination of ETH and Hotel datasets with maps. Maps help to learn a better discriminator, hence increase the performance of our method on ETH and Hotel sets. Note that Trajectron++ also takes maps as input in this experiment for a fair comparison. Our method is significantly better.

Ablation study on nuScenes dataset.

Ablation study on different designs for hallucinative loss L h on nuScenes. In other words, L c trains the discriminator to classify trajectories and L h trains the generator to output trajectories that are hard to be classified if we use hallucinated intent. If we remove L c and keep L h only, our classifier cannot be trained, which make L h meaningless. Hence, the two losses are complementary to one another. The complementary importance of L c to L h , can be explained by drop in performance when L c only is discarded (second row in Tab 5).

Without Map. The performance of our method on UCY pedestrian datasets. We don't have map information in this experiments. Although, our method still achieve comparable performance compared to other state-of-the-art methods and get the best performance in the FDE of most likely trajectories averaged over 3 datasets. Lower is better. Bold indicates best. Our method is significantly better.Gupta et al., 2018) (Vemula et al., 2018) (Salzmann et al., 2020)

A APPENDIX

In this document, we present more explanations and details about the training and the testing results on the NuScenes dataset and the pedestrian datasets with more visualization for our method. It contains the following sections:• Training Details • Qualitative Results

• Pedestrian Experiments

The code and pretrained models will be released in the future (soon).

B TRAINING DETAILS

Original Loss of Generator In this dataset, we initialize the generator of our Halent model with the pretrained Trajectron++ model (Salzmann et al., 2020) . The original Trajectron++ training loss L traj is kept in our method.Here, k 2 is set to 1. Instead of directly learning the distribution of latent intents p(z|x, m), Tra-jectron++ learns q(z|x, m, y gt ) which additionally conditioned on ground truth trajectory during training. p(z|x, m) is learned by reducing the KL divergence between q(z|x, m, y gt and p(z|x, m). k 1 is gradually increase to enhance the information transfer. Note that only p(z|x, m) is used during testing.Differentiable Rasterizer To combine the trajectory y and the local map m into a acceptable format for the CNN-based discriminator, we use differentiable rasterizer (Wang et al., 2020) to convert y, which can be represented by a sequence of T positions {(x 1 , y 1 ), (x 2 , y 2 ), ...(x T , y T )}, into T 2D occupancy grids {G 1 , G 2 , ...G T }. Each grid G t is a tensor with the same weight and height of m. In detail, it creates a bivariate Gaussian distribution N (µ t , Σ t ) for every time step t, where µ t = f a (x t , y t ), Σ = diag(σ 2 , σ 2 ). σ is a hyperparameter. The value for cell (i, j) of G t is the scaled probability density at location (i, j) in the map coordinate systemHere, we normalize the occupancy grids so the maximal amplitude equal to k. By this way, we obtain 2D trajectory grids {G 1 , G 2 , ...G T }, which can be processed by CNN and are differentiable w.r.t the original trajectory. In our experiments, we set k = 9 and σ = 5 based on the hyperparameter search on the validation set.Training We set λ = 0.5 to balance the classified latent intent behavior and hallucinative learning.The model is trained by Adam optimizer (Kingma & Ba, 2014) . The pretrained model is trained by 12 epochs. We continued the training for another 23 epochs with our method and kept the original learning rate for our generator. The learning rate of the discriminator is lower compared to the generator to avoid a large gradient at the beginning of training.

C ADDITIONAL RESULTS ON NUSCENES

Here we report the ADE scores of our method compared to Trajectron++ and the variants of our method in Tab.8 as a supplementary to Tab.1 and Tab.5. The evaluation is on the nuScenes dataset with the vehicle prediction for all tracked objectswith at least 4-second available future data. Compared to Trajectron++ (the last row), we obtain a 25cm improvement in the 4s case measured by ADE-Full, which is about 21% better.Published as a conference paper at ICLR 2021 In Tab.9, we show the importance of the map information to the discriminator by removing the map input to the discriminator and keep all the rest parts the same (the map input to the generator is kept). Due to the lack of map information, the discriminator cannot be well trained and the performance drops compared to our full model. Models are evaluated on nuScenes dataset. ] = Inte(y 0 , â1 , â2 , ..., âT ) // Discriminator judge whether the given trajectory is real/fake and classified to which z // z represents the classification result P(z|ŷ, m). The i-th element zi is the probability ŷ belongs to i-th z judged by the discriminator

F DETAILED ALGORITHM

, x, m ∼ Dataset P(z|x, m), e = Enc θ (x, m) // Sample ẑ, ẑ and a time step t h . Their combination is viewed as the hallucinated intent ẑh ẑ, ẑ, ŷh,2 , ...ŷ h,T ] = Inte(y 0 , âh,1 , âh,2 , ..., âh,T ) D(ŷ h ), zh = Dis ψ (ŷ h , m) // Make ŷh hard to be classified by reducing the cross entropy between a uniform distribution and the classification results zh . N is the number of latent code 

