TRAJECTORY PREDICTION USING EQUIVARIANT CON-TINUOUS CONVOLUTION

Abstract

Trajectory prediction is a critical part of many AI applications, for example, the safe operation of autonomous vehicles. However, current methods are prone to making inconsistent and physically unrealistic predictions. We leverage insights from fluid dynamics to overcome this limitation by considering internal symmetry in real-world trajectories. We propose a novel model, Equivariant Continous COnvolution (ECCO) for improved trajectory prediction. ECCO uses rotationallyequivariant continuous convolutions to embed the symmetries of the system. On both vehicle and pedestrian trajectory datasets, ECCO attains competitive accuracy with significantly fewer parameters. It is also more sample efficient, generalizing automatically from few data points in any orientation. Lastly, ECCO improves generalization with equivariance, resulting in more physically consistent predictions. Our method provides a fresh perspective towards increasing trust and transparency in deep learning models. Our code and data can be found at

1. INTRODUCTION

Trajectory prediction is one of the core tasks in AI, from the movement of basketball players to fluid particles to car traffic (Sanchez-Gonzalez et al., 2020; Gao et al., 2020; Shah & Romijnders, 2016) . A common abstraction underlying these tasks is the movement of many interacting agents, analogous to a many-particle system. Therefore, understanding the states of these particles, their dynamics, and hidden interactions is critical to accurate and robust trajectory forecasting. Even for purely physical systems such as in particle physics, the complex interactions among a large number of particles makes this a difficult problem. For vehicle or pedestrian trajectories, this challenge is further compounded with latent factors such as human psychology. Given these difficulties, current approaches require large amounts of training data and many model parameters. State-of-the-art methods in this domain such as Gao et al. (2020) are based on graph neural networks. They do not exploit the physical properties of system and often make predictions which are not self-consistent or physically meaningful. Furthermore, they predict a single agent trajectory at a time instead of multiple agents simultaneously. Our model is built upon a key insight of many-particle systems pertaining to intricate internal symmetry. Consider a model which predicts the trajectory of cars on a road. To be successful, such a model must understand the physical behavior of vehicles together with human psychology. It should distinguish left from right turns, and give consistent outputs for intersections rotated with different orientation. As shown in Figure 1 , a driver's velocity rotates with the entire scene, whereas vehicle interactions are invariant to such a rotation. Likewise, psychological factors such as reaction speed or attention may be considered vectors with prescribed transformation properties. Data augmentation is a common practice to deal with rotational invariance, but it cannot guarantee invariance and requires longer training. Since rotation is a continuous group, augmentation requires sampling from infinitely many possible angles. In this paper, we propose an equivariant continuous convolutional model, ECCO, for trajectory forecasting. Continuous convolution generalizes discrete convolution and is adapted to data in manyparticle systems with complex local interactions. Ummenhofer et al. (2019) designed a model using continuous convolutions for particle-based fluid simulations. Meanwhile, equivariance to group symmetries has proven to be a powerful tool to integrate physical intuition in physical science applications (Wang et al., 2020; Brown & Lunter, 2019; Kanwar et al., 2020) . Here, we test the hypothesis that an equivariant model can also capture internal symmetry in non-physical human behavior. Our model utilizes a novel weight sharing scheme, torus kernels, and is rotationally equivariant. We evaluate our model on two real-world trajectory datasets: Argoverse autonomous vehicle dataset (Chang et al., 2019) and TrajNet++ pedestrian trajectory forecasting challenge (Kothari et al., 2020) . We demonstrate on par or better prediction accuracy to baseline models and data augmentation with fewer parameters, better sample efficiency, and stronger generalization properties. Lastly, we demonstrate theoretically and experimentally that our polar coordinate-indexed filters have lower equivariance discretization error due to being better adapted to the symmetry group. Our main contributions are as follows: • We propose Equivariant Continous COnvolution (ECCO), a rotationally equivariant deep neural network that can capture internal symmetry in trajectories. • We design ECCO using a novel weight sharing scheme based on orbit decomposition and polar coordinate-indexed filters. We implement equivariance for both the standard and regular representation L 2 (SO(2)). • On benchmark Argoverse and TrajNet++ datasets, ECCO demonstrates comparable accuracy while enjoying better generalization, fewer parameters, and better sample complexity.

2. RELATED WORK

Trajectory Forecasting For vehicle trajectories, classic models in transportation include the Car-Following model (Pipes, 1966) and Intelligent Driver model (Kesting et al., 2010) . Deep learning has also received considerable attention; for example, Liang et al. (2020) and Gao et al. (2020) use graph neural networks to predict vehicle trajectories. Djuric et al. (2018) use rasterizations of the scene with CNN. See the review paper by Veres & Moussa (2019) for deep learning in transportation. For human trajectory modeling, Alahi et al. (2016) propose Social LSTM to learn these humanhuman interactions. TrajNet (Sadeghian et al., 2018) and TrajNet++ (Kothari et al., 2020) introduce benchmarking for human trajectory forecasting. We refer readers to Rudenko et al. (2020) for a comprehensive survey. Nevertheless, many deep learning models are data-driven. They require large amounts of data, have many parameters, and can generate physically inconsistent predictions. Continuous Convolution Continuous convolutions over point clouds (CtsConv) have been successfully applied to classification and segmentation tasks (Wang et al., 2018; Lei et al., 2019; Xu et al., 2018; Wu et al., 2019; Su et al., 2018; Li et al., 2018; Hermosilla et al., 2018; Atzmon et al., 2018; Hua et al., 2018) . More recently, a few works have used continuous convolution for modeling trajectories or flows. For instance, Wang et al. (2018) uses CtsConv for inferring flow on LIDAR data. Schenck & Fox (2018) and Ummenhofer et al. (2019) model fluid simulation using CtsConv. Closely related to our work is Ummenhofer et al. (2019) , who design a continuous convolution network for particle-based fluid simulations. However, they use a ball-to-sphere mapping which is not well-adapted for rotational equivariance and only encode 3 frames of input. Graph neural networks (GNNs) are a related strategy which have been used for modeling particle system dynamics (Sanchez-Gonzalez et al., 2020) . GNNs are also permutation invariant, but they do not natively encode relative positions and local interaction as a CtsConv-based network does. Equivariant and Invariant Deep Learning Developing neural nets that preserve symmetries has been a fundamental task in image recognition (Cohen et al., 2019b; Weiler & Cesa, 2019; Cohen & Welling, 2016a; Chidester et al., 2018; Lenc & Vedaldi, 2015; Kondor & Trivedi, 2018; Bao & Song, 2019; Worrall et al., 2017; Cohen & Welling, 2016b; Weiler et al., 2018; Dieleman et al., 2016; Maron et al., 2020) . Equivariant networks have also been used to predict dynamics: for example, Wang et al. (2020) predicts fluid flow using Galilean equivariance but only for gridded data. Fuchs et al. (2020) use SE(3)-equivariant transformers to predict trajectories for a small number of particles as a regression task. As in this paper, both Bekkers (2020) and Finzi et al. (2020) address the challenge of parameterizing a kernel over continuous Lie groups. Finzi et al. (2020) apply their method to trajectory prediction on point clouds using a small number of points following strict physical laws. Worrall et al. ( 2017) also parameterizes convolutional kernels using polar coordinates, but maps these onto a rectilinear grid for application to image data. 

3. BACKGROUND

We first review the necessary background of continuous convolution and rotational equivariance.

3.1. CONTINUOUS CONVOLUTION

Continuous convolution (CtsConv) generalizes the discrete convolution to point clouds. It provides an efficient and spatially aware way to model the interactions of nearby particles. Let f (i) ∈ R cin denote the feature vector of particle i. Thus f is a vector field which assigns to the points x (i) a vector in R cin . The kernel of the convolution K : R 2 → R cout×cin is a matrix field: for each point x ∈ R 2 , K(x) is a c out × c in matrix. Let a be a radial local attention map with a(r) = 0 for r > R. The output feature vector g (i) of particle i from the continous convolution is given by g (i) = CtsConv K,R (x, f ; x (i) ) = j a( x (j) -x (i) )K(x (j) -x (i) ) • f (j) . CtsConv is naturally equivariant to permutation of labels and is translation invariant. Equation 1 is closely related to graph neural network (GNN) (Kipf & Welling, 2017; Battaglia et al., 2018) , which is also permutation invariant. Here the graph is dynamic and implicit with nodes x (i) and edges e ij if x (i) -x (j) < R. Unlike a GNN which applies the same weights to all neighbours, the kernel K depends on the relative position vector x (i) -x (j) .

3.2. ROTATIONAL EQUIVARIANCE

Continuous convolution is not naturally rotationally equivariant. Fortunately, we can translate the technique of rotational equivariance on CNNs to continuous convolutions. We use the language of Lie groups and their representations. For more background, see Hall (2015) and Knapp (2013) . More precisely, we denote the symmetry group of 2D rotations by SO(2) = {Rot θ : 0 ≤ θ < 2π}. As a Lie group, it has both a group structure Rot θ1 • Rot θ2 = Rot (θ1+θ2)mod2π which a continuous map with respect to the topological structure. As a manifold, SO(2) is homomeomorphic to the circle S 1 ∼ = {x ∈ R 2 : x = 1}. The group SO(2) can act on a vector space R c by specifying a representation map ρ : SO(2) → GL(R c ) which assigns to each element of SO(2) an element of the set of invertible c × c matrices GL(R c ). The map ρ must a be homomorphism ρ(Rot θ1 )ρ(Rot θ1 ) = ρ(Rot θ1 • Rot θ2 ). For example, the standard representation ρ 1 on R 2 is by 2 × 2 rotation matrices. The regular representation ρ reg on L 2 (SO(2)) = {ϕ : SO(2) → R : |ϕ| 2 is integrable} is ρ reg (Rot φ )(ϕ) = ϕ • Rot -φ . Given input f with representation ρ in and output with representation ρ out , a map F is SO(2)-equivariant if F (ρ in (Rot θ )f ) = ρ out (Rot θ )F (f ). Discrete CNNs are equivariant to a group G if the input, output, and hidden layers carry a G-action and the linear layers and activation functions are all equivariant (Kondor & Trivedi, 2018) . One method for constructing equivariant discrete CNNs is steerable CNN (Cohen & Welling, 2016b) . Cohen et al. (2019a) derive a general constraint for when a convolutional kernel K : R b → R cout×cin is G-equivariant. Assume G acts on R b and that R cout and R cin are G-representations ρ out and ρ in respectively, then K is G-equivariant if for all g ∈ G, x ∈ R 2 , K(gx) = ρ out (g)K(x)ρ in (g -1 ). For the group SO(2), Weiler & Cesa (2019) solve this constraint using circular harmonic functions to give a basis of discrete equivariant kernels. In contrast, our method is much simpler and uses orbits and stabilizers to create continuous convolution kernels.

4. ECCO: TRAJECTORY PREDICTION USING ROTATIONALLY EQUIVARIANT CONTINUOUS CONVOLUTION

In trajectory prediction, given historical position and velocity data of n particles over t in timesteps, we want to predict their positions over the next t out timesteps. Denote the ground truth dynamics as ξ, which maps ξ(x t-tin:t , v t-tin:t ) = x t:t+tout . Motivated by the observation in Figure 1 , we wish to learn a model f that approximates the underlying dynamics while preserving the internal symmetry in the data, specifically rotational equivariance. We introduce ECCO, a model for trajectory prediction based on rotationally equivariant continuous convolution. We implement rotationally equivariant continuous convolutions using a weight sharing scheme based on orbit decomposition. We also describe equivariant per-particle linear layers which are a special case of continuous convolution with radius R = 0 analogous to 1x1 convolutions in CNNs. Such layers are useful for passing information between layers from each particle to itself. The high-level architecture of ECCO is illustrated in Figure 2 . It is important to remember that the input, output, and hidden layers are all vector fields over the particles. Oftentimes, there is also environmental information available in the form of road lane markers. Denote marker positions by x map and direction vectors by v map . This data is thus also a particle field, but static.

4.1. ECCO MODEL OVERVIEW

To design an equivariant network, one must choose the group representation. This choice plays an important role in shaping the learned hidden states. We focus on two representations of SO(2): ρ 1 and ρ reg . The representation ρ 1 is that of our input features, and ρ reg is for the hidden layers. For ρ 1 , we constrain the kernel in Equation 1. For ρ reg , we further introduce a new operator, convolution with torus kernels. In order to make continuous convolution rotationally equivariant, we translate the general condition for discrete CNNs developed in Weiler & Cesa (2019) to continuous convolution. We define the convolution kernel K in polar coordinates K(θ, r). Let R cout and R cin be SO(2)-representations ρ out and ρ in respectively, then the equivariance condition requires the kernel to satisfy K(θ + φ, r) = ρ out (Rot θ )K(φ, r)ρ in (Rot -1 θ ). Imposing such a constraint for continuous convolution requires us to develop an efficient weight sharing scheme for the kernels, which solve Equation 3. 4.2 WEIGHT SHARING BY ORBITS AND STABILIZERS. Given a point x ∈ R 2 and a group G, the set O x = {gx : g ∈ G} is the orbit of the point x. The set of orbits gives a partition of R 2 into the origin and circles of radius r > 0. The set of group elements G x = {g : gx = x} fixing x is called the stabilizer of the point x. We use the orbits and stabilizers to constrain the weights of K. Simply put, we share weights across orbits and constrain weights according to stabilizers, as shown in Figure 3 -Left. The ray D = {(0, r) : r ≥ 0} is a fundamental domain for the action of G = SO(2) on base space R 2 . That is, D contains exactly one point from each orbit. We first define K(0, r) for each (0, r) ∈ D. Then we compute K(θ, r) from K(0, r) by setting φ = 0 in Equation 3 as such K(θ, r) = ρ out (Rot θ )K(0, r)ρ in (Rot -1 θ ). For r > 0, the group acts freely on (0, r), i.e. the stabilizer contains only the identity. This means that Equation 3 imposes no additional constraints on K(0, r). Thus K(0, r) ∈ R cout×cin is a matrix of freely learnable weights. For r = 0, however, the orbit O (0,0) is only one point. The stabilizer of (0, 0) is all of G, which requires K(0, 0) = ρ out (Rot θ )K(0, 0)ρ in (Rot -1 θ ) for all θ. (5) Thus K(0, 0) is an equivariant per-particle linear map ρ in → ρ out . Table 1 : Equivariant linear maps for K(0, 0). Trainable weights are c ∈ R and κ : S 1 → R, where S 1 is the manifold underlying SO(2). ρ in ρ out = ρ 1 ρ out = ρ reg ρ 1 (a, b) → (ca, cb) ca cos(θ) + cb sin(θ) ρ reg f → c S 1 f (θ) cos(θ)dθ S 1 f (θ) sin(θ)dθ) S 1 κ(θ -φ)f (φ)dφ We can analytically solve Equation 5for K(0, 0) using representation theory. (2018) respectively. They both use harmonic functions which require expensive evaluation of analytic functions at each point. Instead, we provide a simpler solution. We require only knowledge of the orbits, stabilizers, and input/output representations. Additionally, we bypass Clebsch-Gordon decomposition used in Thomas et al. (2018) by mapping directly between the representations in our network. Next, we describe an efficient implementation of equivariant continuous convolution.

4.3. POLAR COORDINATE KERNELS

Rotational equivariance informs our kernel discretization and implementation. We store the kernel K of continuous convolution as a 4-dimensional tensor by discretizing the domain. Specifically, we discretize R 2 using polar coordinates with k θ angular slices and k r radial steps. We then evaluate K at any (θ, r) using bilinear interpolation from four closest polar grid points. This method accelerates computation since we do not need to use Equation 4 to repeatedly compute K(θ, r) from K(0, r). The special case of K(0, 0) results in a polar grid with a "bullseye" at the center (see Figure 3 -Left). We discretize angles finely and radii more coarsely. This choice is inspired by real-world observation that drivers tend to be more sensitive to the angle of an incoming car than its exact distance, Our equivariant kernels are computationally efficient and have very few parameters. Moreover, we will discuss later in Section 4.5 that despite discretization, the use of polar coordinates allows for very low equivariance error.

4.4. HIDDEN LAYERS AS REGULAR REPRESENTATIONS

Regular representation ρ reg has shown better performance than ρ 1 for finite groups (Cohen et al., 2019a; Weiler & Cesa, 2019) . But the naive ρ reg = {ϕ : G → R} for an infinite group G is too large to work with. We choose the space of square-integrable functions L 2 (G). It contains all irreducible representations of G and is compatible with pointwise non-linearities. Discretization. However, L 2 (SO(2)) is still infinite-dimensional. We resolve this by discretizing the manifold S 1 underlying SO(2) into k reg even intervals. We represent functions f ∈ L 2 (SO(2)) by the vector of values [f (Rot 2πi/kreg )] 0≤i<kreg . We then evaluate f (Rot θ ) using interpolation. We separate the number of angular slices k θ and the size of the kernel k reg . If we tie them together and set k θ = k reg , this is equivalent to implementing cyclic group C kreg symmetry with the regular representation. Then increasing k θ would also increases k reg , which incurs more parameters. Convolution with Torus Kernel. In addition to constraining the kernel K of Equation 1 as in ρ 1 , ρ reg poses an additional challenge as it is a function on a circle. We introduce a new operator from functions on the circle to functions on the circle called a torus kernel. First, we replace input feature vectors in f ∈ R c with elements of L 2 (SO(2)). The input feature f becomes a ρ reg -field, that is, for each x ∈ R 2 , f (x) is a real-value function on the circle S 1 → R. For the kernel K, we replace the matrix field with a map K : R 2 → ρ reg ⊗ ρ reg . Instead of a matrix, K(x) is a map S 1 × S 1 → R. Here (φ 1 , φ 2 ) ∈ S 1 × S 1 plays the role of continuous matrix indices and we may consider K(x)(φ 1 , φ 2 ) ∈ R analogous to a matrix entry. Topologically, S 1 × S 1 is a torus and hence we call K(x) a torus kernel. The matrix multiplication K(x) • f (x) in Equation 1must be replaced by the integral transform K(x) f (x) (φ 2 ) = φ1∈S 1 K(x)(φ 2 , φ 1 )f (x) (φ 1 )dφ 1 , which is a linear transformation L 2 (SO(2)) → L 2 (SO(2)). K(θ, r)(φ 2 , φ 1 ) denotes the (φ 2 , φ 1 ) entry of the matrix at point x = (θ, r), see the illustration in Figure 3 -Right. We compute Equation 3for ρ reg → ρ reg as K(Rot θ (x))(φ 2 , φ 1 ) = K(x)(φ 2 -θ, φ 1 -θ). We can use the same weight sharing scheme as in Section 4.2.

4.5. ANALYSIS: EQUIVARIANCE ERROR

Figure 4 : Experimentally, we find k θ and expected equivariance error are inversely proportional. The practical value of equivariant neural networks has been demonstrated in a variety of domains. However, theoretical analysis (Kondor & Trivedi, 2018; Cohen et al., 2019a; Maron et al., 2020) of continuous Lie group symmetries is usually performed assuming continuous functions and using the integral representation of the convolution operator. In practice, discretization can cause the model f to be not exactly equivariant, with some equivariance error (EE) EE = f (T (x)) -T (f (x)) with respect to group transformations T and T of input and output respectively (Wang et al., 2020, A6) . Rectangular grids are well-suited to translations, but poorlysuited to rotations. The resulting equivariance error can be so large to practically undermine the advantages of a theoretically equivariant network. Our polar-coordinate indexed circular filters are designed specifically to adapt well to the rotational symmetry. In Figure 4 , we demonstrate experimentally that expected EE is inversely proportional to the number of angular slices k θ . For example, choosing k θ ≥ 16 gives very low EE and does not increase the number of parameters. We also prove for ρ 1 features that the equivariance error is low in expectation. See Appendix A.6 for the precise statement and proof. Proposition. Let α = 2π/k θ , and θ be θ rounded to nearest value in Zα, and θ = |θ -θ|. Let F = CtsConv K,R and T = ρ 1 (Rot θ ). For some constant C, the expected EE is bounded E K,f ,x [T (F (f , x)) -F (T (f ), T (x))] ≤ | sin( θ)|C ≤ 2πC/k θ .

5. EXPERIMENTS

In this section, we present experiments in two different domains, traffic and pedestrian trajectory prediction, where interactions among agents are frequent and influential. We first introduce the statistics of the datasets and the evaluation metrics. Secondly, we compare different feature encoders and hidden feature representation types. Lastly, we compare our model with baselines.

5.1. EXPERIMENTAL SET UP

Dataset We discuss the performances of our models on (1) Argoverse autonomous vehicle motion forecasting (Chang et al., 2019) , a recently released vehicle trajectory prediction benchmark, and (2) TrajNet++ pedestrian trajectory forecasting challenge (Kothari et al., 2020) . For Argoverse, the task is to predict three-second trajectories based on all vehicles history in the past 2 seconds. We split 32K samples from the validation set as our test set. Baselines We compare against several state-of-the-art baselines used in Argoverse and TrajNet++. We use three original baselines from (Chang et al., 2019) : Constant velocity, Nearest Neighbour, and Long Short Term Memory (LSTM). We also compare with a non-equivariant continuous convolutional model, CtsConv (Ummenhofer et al., 2019) and a hierarchical GNN model VectorNet (Gao et al., 2020) . Note that VectorNet only predicts a single agent at a time, which is not directly comparable with ours. We include VectorNet as a reference nevertheless.

Evaluation Metrics

We use domain standard metrics to evaluate the trajectory prediction performance, including (1) Average Displacement Error (ADE): the average L2 displacement error for the whole 30 timestamps between prediction and ground truth, (2) Displacement Error at t seconds (DE@ts): the L2 displacement error at a given timestep t. DE@ts for the last timestamp is also called Final Displacement Error (FDE). For Argoverse, we report ADE and DE@ts for t ∈ {1, 2, 3}. For TrajNet++, we report ADE and FDE.

5.2. PREDICTION PERFORMANCE COMPARISON

We evaluate the performance of different models from multiple aspects: forecasting accuracy, parameter efficiency and the physical consistency of the predictions. The goal is to provide a comprehensive view of various characteristics of our model to guide practical deployment. See Appendix A.9 for an additional ablative study. Forecasting Accuracy We compare the trajectory prediction accuracy across different models on Argoverse and TrajNet++. Table 2 displays the prediction ADE and FDE comparision. We see that ECCO with the regular representation ρ achieves on par or better forecasting accuracy on both datasets. Comparing ECCO and a non-equivariant counterpart of our model CtsConv, we observe a significant 14.8% improvement in forecasting accuracy. Compare with data augmentation, we also observe a 9% improvement over the non-equivariant CtsConv trained on random-rotationaugmented dataset. These results demonstrate the benefits of incorporating equivariance principles into deep learning models. Parameter Efficiency Another important feature in deploying deep learning models to embedded systems such as autonomous vehicles is parameter efficiency. We report the number of parameters in each of the models in Table 2 . Compare with LSTM, our forecasting performance is significantly better. CtsConv and VectorNet have competitive forecasting performance, but uses much more parameters than ECCO. By encoding equivariance into CtsConv, we drastically reduce the number of the parameters needed in our model. For VectorNet, Gao et al. (2020) only provided the number of parameters for their encoder; a fair decoder size can be estimated based on MLP using 59 polygraphs with each 64 dimensions as input, predicting 30 timestamps, that is 113K.

Runtime and Memory Efficiency

We compare the runtime and memory usage with VectorNet Gao et al. (2020) . Since VectorNet is not open-sourced, we compare with a version of VectorNet that we implement. Firstly, we compare floating point operations (FLOPs). VectorNet reported n × 0.041 GFLOPs for the encoder part of their model alone, where n is the number of predicted vehicles. We tested ECCO on a scene with 30 vehicles and approximately 180 lane marker nodes, which is similar to the test conditions used to compute FLOPs in Gao et al. (2020) . Our full model used 1.03 GFLOPs versus 1.23 GFLOPs for VectorNet's encoder. For runtimes on the same test machine, ECCO runs 684ms versus 1103ms for VectorNet. Another disadvantage of VectorNet is needing to reprocess the scene for each agent, whereas ECCO predicts all agents simultaneously. For memory usage in the same test ECCO uses 296 MB and VectorNet uses 171 MB. Sample Efficiency A major benefit of incorporating the inductive bias of equivariance is to improve the sample efficiency of learning. For each sample which an equivariant model is trained on, it learns as if it were trained on all transformations of that sample by the symmetry group (Wang et al., 2020, Prop 3) . Thus ECCO requires far fewer samples to learn from. In Figure 5 , we plot a comparison of validation FDE over number of training samples and show the equivariant models converge faster.

Physical Consistency

We also visualize the predictions from ECCO and non-equivariant CtsConv, as shown in Figure 6 . Top row visualizes the predictions on the original data. In the bottom row, we rotate the whole scene by 160 • and make predictions on rotated data. This mimics the covariate shift in the real world. Note that CtsConv predicts inconsistently: a right turn in the top row but a left turn after the scene has been rotated. We see similar results for TrajNet++ (see Figure 8 in Appendix A.10).

6. CONCLUSION

We propose Equivariant Continuous Convolution (ECCO), a novel model for trajectory prediction by imposing symmetries as inductive biases. On two real-world vehicle and pedestrians trajectory datasets, ECCO attains competitive accuracy with significantly fewer parameters. It is also more sample efficient; generalizing automatically from few data points in any orientation. Lastly, equivariance gives ECCO improved generalization performance. Our method provides a fresh perspective towards increasing trust in deep learning models through guaranteed properties. Future directions include applying equivariance to probabilistic predictions with many possible trajectories, or developing a faster version of ECCO which does not require autoregressive computation. Moreover, our methods may be generalized from 2-dimensional space to R n . The orbit-stabilizer weight sharing scheme and discretized regular representation may be generalized by replacing SO(2) with SO(n), and polar coordinate kernels may be generalized using spherical coordinates. Published as a conference paper at ICLR 2021 By the Peter-Weyl theorem, L 2 (SO(2)) ∼ = ∞ i=0 ρ i . In the case of SO(2), this decomposition is also called the Fourier decomposition or decomposition into circular harmonics. Most importantly, there is one copy of ρ 1 inside of L 2 (SO(2)). Hence, up to scalar, there is a unique linear map i 1 : ρ 1 → L 2 (SO(2)) given by (a, b) → a cos(θ) + b sin(θ). The reverse mapping pr 1 : L 2 (SO(2)) → ρ 1 is projection onto the ρ 1 summand and is given by the Fourier transform pr i (f ) = ( S 1 f (θ) cos(θ)dθ, S 1 f (θ) sin(θ)dθ). Per-particle linear mapping ρ reg → ρ reg . Though ρ reg is not finite-dimensional, the fact that it decomposes into a direct sum of irreducible representations means that we may take ρ in = ρ out = ρ reg above. Practically, however, it is easier to realize the linear equivariant map ρ i reg → ρ j reg as a convolution over S 1 , O(θ) = φ∈S 1 κ(θ -φ)I(φ) where κ(θ) is an i × j matrix of trainable weights, independent for each θ.

A.4 ENCODING INDIVIDUAL PARTICLE PAST BEHAVIOR

We can encode these individual attributes using a per vehicle LSTM (Hochreiter & Schmidhuber, 1997) . Let X (i) t denote the position of car i at time t. Denote a fully connected LSTM cell by h t , c t = LSTM(X (i) t , h t-1 , c t-1 ). Define h 0 = c 0 = 0. We then use the concatenation of the hidden states [h (1) tin . . . h

(n)

tin ] of all particles as Z ∈ R N ⊗ R k as the encoded per-vehicle latent features.

A.5 ENCODING PAST INTERACTIONS

In addition, we also encode past interactions of particles by introducing a continuous convolution LSTM. Similar to convLSTM we replace the fully connected layers of the original LSTM above with another operation Xingjian et al. (2015) . While convLSTM is well-suited for capturing spatially local interactions over time, it requires gridded information. Since the particle system we consider are distributed in continuous space, we replace the standard convolution with rotation-equivariant continuous convolutions. We can now define H t , C t = CtsConvLSTM(X t , H t-1 , C t-1 ) which is an LSTM cell using equivariant continuous convolutions throughout. Note that in this case X t , H t-1 , C t-1 are all particle feature fields, that is, functions {1, . . . , n} → R k . Define CtsConvLSTM by i t = σ(W ix cts X (i) t + W ih cts h t-1 + W ic • c t-1 + b i ) f t = σ(W f x cts X (i) t + W f h cts h t-1 + W f c • c t-1 + b i ) c t = f t • c t-1 + i t • tanh(W cx cts X (i) t + W ch cts h t-1 + b c ) o t = σ(W ox cts X (i) t + W oh cts h t-1 + W oc • c t + b o ) h t = o t • tanh(c t ), where cts denotes CtsConv. We then can use H tin as input feature for the prediction network.

A.6 EQUIVARIANCE ERROR

We prove the proposition in Section 4.5. Proposition. Let α = 2π/k θ . Let θ be θ rounded to nearest value in Zα. Set θ = |θ -θ|. Assume n particles samples uniformly in a ball of radius R with features f ∈ ρ c 1 . Let f and K have entries sampled uniformly in [-a, a] . Let the bullseye have radius 0 < R e < R. Let F = CtsConv K,R and T θ = ρ 1 (Rot θ ). Then the expected EE is bounded E K,f ,x [T (F (f , x)) -F (T (f ), T (x))] ≤ | sin( θ)|C ≤ 2πC/k θ where C = 4cna 2 (1 -R 2 e /R 2 ). TrajNet++ Real dataset contains 200K samples. All the tracking in this dataset is captured in both indoor and outdoor locations, for example, university, hotel, Zara, and train stations. Every sample in this dataset contains 21 timestamps, and the goal is to predict the 2D spatial positions for each pedestrain in the future 12 timestamps. A.8 IMPLEMENTATION DETAILS Argoverse dataset is not fully observed, so we only use cars with complete observation as our input. Since every sample doesn't include the same number of cars, we only choose those scenes with less than or equal to 60 cars and insert dummy cars into them to achieve consistent car numbers. Tra-jNet++ Real dataset is also not fully observed. And here we keep our pedestrain number consistent to 160. Moreover, for each car, we use the average velocity in the past 0.1 second as an approximate to the current instant velocity, i.e. v t = (p t -p t-1 )/2. As for map information, we only include center lanes with lane directions as features. Also, we introduce dummy lane node into each scene to make lane numbers consistently equal to 650. In TrajNet++ task, no map information is included. And since pedestrians don't have a speedometers to tell them exactly how fast they are moving as drivers, instead they depends more on the relative velocities and relative positions to other pedestrians, we tried different combination of features in ablative study besides only using history velocities. Our models are all trained by Adam optimizer with base learning rate 0.001, and the gamma rate for linear rate scheduler is set to be 0.95. All our models without map information are trained for 15K iterations with batch size 16 and learning rate is updated every 300 iterations; for models with map information, we train them for 30K iterations with batch size 16 and learning rate is updated every 600 iterations. For CtsConv, we set the layer sizes to be 32, 64, 64, 64, and kernel size 4 × 4 × 4; for ρ 1 -ECCO, the layer sizes are 16, 32, 32, 32, k θ is 16, k r is 3; for ρ reg -ECCO, we choose layer size 8, 16, 8, 8, k θ 16, k r 3, and regular feature dimension is set to be 8. For Argoverse task, we set the CtsConv radius to be 40, and for TrajNet++ task we set it to be 6.

A.9 ABLATIVE STUDY

We perform ablative study for ECCO to further diagnose different encoders, usage of HD maps and other model design choices. Choice of encoders Unlike fluid simulations (Ummenhofer et al., 2019) where the dynamics are Markovian, human behavior exhibit long-term dependency. We experiment with three different encoders refered to as Enc to model such long-term dependency: (1) concatenating the velocities from the past m frames as input feature, (2) passing the past velocities of each particle to the same LSTM to encode individual behavior of each particle, and (3) implementing continuous convolution LSTM to encode past particle interactions. Our continuous convolution LSTM is similar to convLSTM (Xingjian et al., 2015) but uses continuous convolutions instead of discrete gridded convolutions. We use different encoders to time-aggregate features and compare their performances (Table 3 ).

Use of HD Maps

In Table 4 , we compare performance with and without map input features. Choice of features for pedestrian Unlike vehicles, people do not have a velocity meter to tell him how fast they actually walk. We realize that people actually tend to adjust their velocities based on others' relative velocity and relative position. We experiment different combination of features (Table 5 ), finding using relative velocities and relative positions as feature has the best performance. A.10 QUALITATIVE RESULTS FOR TRAJNET++ Figure 8 show qualitative results for TrajNet++. Note that the non-equivariant baseline (2nd column) depends highly on the global orientation whereas the ground truth and equivariant models do not. 



Figure 1: Car trajectories in two scenes. Though the entire scenes are not related by a rotation, the circled areas are. ECCO exploits this symmetry to improve generalization and sample efficiency.

Figure 2: Overview of model architecture. Past velocities are aggregated by an encoder Enc. Together with map information this is then encoded by 3 CtsConvs into ρ reg features. Then l + 1 CtsConv layers are used to predict ∆x. The predicted position xt+1 = ∆x+ x where x is a numerically extrapolated using velocity and accleration. Since ∆x is translation invariant, x is equivariant.

Figure3: Left: A torus kernel field K from a ρ reg -field to a ρ reg -field. The kernel is itself a field: at each point x in space the kernel K(x) yields a different matrix. We denote the (φ 2 , φ 1 ) entry of the matrix at x = (θ, r) by K(θ, r)(φ 2 , φ 1 ). The matrices along the red sector are freely trainable. The matrices at all white sectors are determined by those in the red sector according to the circular shifting rule illustrated above. The matrix at the red bullseye is trainable but constrained to be circulant, i.e. preserved by the circular shifting rule. Right: The torus kernel acts on features which are functions on the circle. By cutting open the torus and features along the reg and orange lines we can identify the operation at each point with matrix multiplication.

Figure 6: The x,y-axes are the position (m). The dashed line represents the 2s past trajectory. The solid line represents the 3s prediction. Red represents the agent. Top row: The predictions are made on the original data. Bottom row: We rotate the whole scene by 160 • and make predictions on rotated data. From left to right are visualizations of ground truth, CtsConv, ρ 1 -ECCO, ρ reg -ECCO.

Figure 8: The x,y-axes are the position (m). The dashed line represents the 2s past trajectory. The solid line represents the 3s prediction. Red represents the agent. Top row: The predictions are made on the original data. Bottom row: We rotate the whole scene by 160 • and make predictions on rotated data. From left to right are visualizations of ground truth, CtsConv, ρ 1 -ECCO, ρ reg -ECCO.



Parameter efficiency and accuracy comparison. Number of parameters for each model and their detailed forecasting accuracy at DE@ts. CtsConv(Aug.) is CtsConv trained with rotation augmented data.

Ablation study on encoders for Argoverse and TrajNet++. Markovian: Use the velocity from the most recent time step as input feature. LSTM: Used LSTM to encode velocities of 20 timestamps. CtsConvLSTM: Instead of dense layer, the gate functions in LSTM are replaced by CtsConv. CtsConvDLSTM: Replaced gate functions by CtsConv + Dense. D-Concat (20t feats): Stacked velocities of 20 time steps as input.

Ablative study on HD maps for Argoverse. Prediction accuracy comparison with and without HD Maps.

Ablative study on features for Traj++. Acceleration means whether we used acceleration to make numerically extrapolated position.

ACKNOWLEDGEMENT

This work was supported in part by Google Faculty Research Award, NSF Grant #2037745, and the U. S. Army Research Office under Grant W911NF-20-1-0334. Walters is supported by a Postdoctoral Fellowship from the Institute for Experiential AI at the Roux Institute.

availability

https://github.com/Rose

A APPENDIX A.1 CONTINUOUS CONVOLUTION INVOLVING ρ reg

This section is a more detailed version of Section 4.4.Define the input f to be ρ reg -field, that is, a distribution over R 2 valued in ρ reg . Define K : R 2 → ρ reg ⊗ ρ reg . After identifying SO(2) with its underlying manifold S 1 , we can identify K(x) as a map S 1 × S 1 → R and f (x) : S 1 → R. Define the integral transformThe -operation parameterizes linear maps ρ reg → ρ reg and is thus analogous to matrix multiplication. If we chose to restrict our choice of κ to κ(φ 2 , φ 1 ) = κ(φ 2 -φ 1 ) for some function κ : S 1 → R then this becomes the circular convolution operation.The SO(2)-action on ρ reg by Rot θ (f )(φ) = f (φ -θ) induces an action on κ :This, in turn, gives an action on the torus-field K byThus Equation 3, the convolutional kernel constraint, implies that K is equivariant if and only ifWe use this to define a weight sharing scheme as described in Section 3.2. The cases of continuous convolution ρ 1 → ρ reg and ρ reg → ρ 1 may be derived similarly.

A.2 COMPLEXITY OF CONVOLUTION WITH TORUS KERNEL

The complexity class of the convolution with torus kernel iswhere n is the number of particles, the regular representation is discretized into k reg pieces, and the input and output contain c in and c out copies of the regular representation respectively. We are not counting the complexity of the interpolation operation for looking up K(θ, r).

A.3 EQUIVARIANT PER-PARTICLE LINEAR LAYERS

Since this operation is pointwise, unlike positive radius continuous convolution, we cannot map between different irreducible representations of SO(2). Consider as input a ρ in -field I and output a ρ out -field O where ρ in and ρ out are finite-dimensional representations of SO(2). We define O (i) = W I (i) using the same W , an equivariant linear map, for each particle 1 ≤ i ≤ N . Denote the decomposition of ρ in and ρ out into irreducible representations of SO(2) as ρ in ∼ = ρ i1 1 ⊕. . .⊕ρ in n and ρ out ∼ = ρ j1 1 ⊕ . . . ⊕ ρ jn n respectively. By Schur's lemma, the equivariant linear map W : ρ in → ρ out is defined by a block diagonal matrix with blocks {W k } n k=1 where W k is an i k × j k matrix. That is, maps between different irreducible representations are zero and each map ρ k → ρ k is given by a single scalar.Per-particle linear mapping ρ 1 → ρ reg and ρ 1 → ρ reg . Since the input and output features are ρ 1 -fields, but the hidden features may be represented by ρ reg , we need mappings between ρ 1 and ρ reg . In all cases we pair continuous convolutions with dense per-particle mappings, this we must describe per-particle mappings between ρ 1 and ρ reg .Proof. We may compute for a single particle x = (ψ, r) and multiply our result by n by linearity. We separate two cases: x in bullseye with probability R 2 e /R 2 and x in angular slice with probability 1 -R 2 e /R 2 . If x is in the bullseye, then there is no equivariance error since K(x) is a scalar matrix. Assume x is an angular sector.For nearest interpolation, the equivariance error is thenwhere β = ± θ. We consider only a single factor of ρ 1 in f . The result will then be multiplied by c. LetWe can factor out an a from K(x) and an a from f and assume k ij , f i samples from Uniform([-1, 1]). One may then directly compute that Equation 7equals 

