EFFICIENT LONG-RANGE CONVOLUTIONS FOR POINT CLOUDS

Abstract

The efficient treatment of long-range interactions for point clouds is a challenging problem in many scientific machine learning applications. To extract global information, one usually needs a large window size, a large number of layers, and/or a large number of channels. This can often significantly increase the computational cost. In this work, we present a novel neural network layer that directly incorporates long-range information for a point cloud. This layer, dubbed the long-range convolutional (LRC)-layer, leverages the convolutional theorem coupled with the non-uniform Fourier transform. In a nutshell, the LRC-layer mollifies the point cloud to an adequately sized regular grid, computes its Fourier transform, multiplies the result by a set of trainable Fourier multipliers, computes the inverse Fourier transform, and finally interpolates the result back to the point cloud. The resulting global all-to-all convolution operation can be performed in nearly-linear time asymptotically with respect to the number of input points. The LRC-layer is a particularly powerful tool when combined with local convolution as together they offer efficient and seamless treatment of both short and long range interactions. We showcase this framework by introducing a neural network architecture that combines LRC-layers with short-range convolutional layers to accurately learn the energy and force associated with a N -body potential. We also exploit the induced two-level decomposition and propose an efficient strategy to train the combined architecture with a reduced number of samples.

1. INTRODUCTION

Point-cloud representations provide detailed information of objects and environments. The development of novel acquisition techniques, such as laser scanning, digital photogrammetry, light detection and ranging (LIDAR), 3D scanners, structure-from-motion (SFM), among others, has increased the interest of using point cloud representation in various applications such as digital preservation, surveying, autonomous driving (Chen et al., 2017) , 3D gaming, robotics (Oh & Watanabe, 2002) , and virtual reality (Park et al., 2008) . In return, this new interest has fueled the development of machine learning frameworks that use point clouds as input. Historically, early methods used a preprocessing stage that extracted meticulously hand-crafted features from the point cloud, which were subsequently fed to a neural network (Chen et al., 2003; Rusu et al., 2008; Rusu et al., 2009; Aubry et al., 2011) , or they relied on voxelization of the geometry (Savva et al., 2016; Wu et al., 2015; Riegler et al., 2017; Maturana & Scherer, 2015) . The PointNet architecture (Qi et al., 2017) was the first to handle raw point cloud data directly and learn features on the fly. This work has spawned several related approaches, aiming to attenuate drawbacks from the original methodology, such as PointNet++ (Qi et al., 2017) , or to increase the accuracy and range of application (Wang et al., 2019; Zhai et al., 2020; Li et al., 2018; Liu et al., 2019) . Even though such methods have been quite successful for machine learning problems, they rely on an assumption of locality, which may produce large errors when the underlying task at hand exhibits long-range interactions (LRIs). To capture such interactions using standard convolutional layers, one can use wider window sizes, deeper networks, and/or a large number of features, which may increase the computational cost significantly. Several approaches have been proposed to efficiently capture such interactions in tasks such as semantic segmentation, of which the ideas we briefly summarize below. In the multi-scale type of approaches, features are progressively processed and merged. Within this family, there exist several variants, where the underlying neural networks can be either recursive neural networks (Ye et al., 2018) , convolutional layers (Xu et al., 2019; Xu et al., 2018) or autoencoders (Yang et al., 2018; Deng et al., 2018) . Some works have proposed skip connections, following an U-net (Ronneberger et al., 2015) type architecture (Zhou & Tuzel, 2018; Qi et al., 2017) , while others have focused on using a tree structure for the clustering of the points (Klokov & Lempitsky, 2017; Zeng & Gevers, 2018; Gadelha et al., 2018) , or using an reference permutohedral lattices to compute convolutions (Jampani et al., 2016) whose results are interpolated back to the point cloud (Su et al., 2018) . Although these methods have been shown to be successful in a range of applications, when the task at hand presents symmetries, such as rotation, translation, and permutation invariance, there is no systematic framework to embed those symmetries into the algorithmic pipelines. Another line of work, relies on interpreting the point cloud as a graph and use spectral convolutions (Bruna et al.; Defferrard et al., 2016) , whose cost can scale super-linearly when dealing with LRIs. In applications of machine learning to scientific computing, several classical multilevel matrix factorizations have been rewritten in the context of machine learning (Kondor et al., 2014) , which have been adapted to handle long-range interactions in the context of end-to-end maps using voxelized geometries in (Fan et al., 2019b; a; Khoo & Ying, 2019; Fan & Ying, 2019) resulting in architectures similar to U-nets (Ronneberger et al., 2015) , which have been extended to point clouds in (Li et al., 2020) . Due to underlying voxelization of the geometry, it may be difficult for these networks to generalize when the resolution of the voxelization changes. The efficient treatment of LRI for point clouds is also a prominent problem in many physical applications such as molecular modeling and molecular dynamics simulation. While long-range electrostatic interactions are omnipresent, it has been found that effectively short-ranged models can already describe the N -body potential and the associated force field (Behler & Parrinello, 2007; Zhang et al., 2018a; b) for a wide range of physical systems. There have also been a number of recent works aiming at more general systems beyond this regime of effective short-range interactions, such as the work of Ceriotti and co-workers (Grisafi & Ceriotti, 2019; Grisafi et al.; Nigam et al., 2020; Rossi et al., 2020) , as well as the works of (Yao et al., 2018; Ko et al., 2009; Hirn et al., 2017; Rupp et al., 2012; Huo & Rupp; Deng et al., 2019; Bereau et al., 2018; Zhang et al., 2019) . The general strategy is to build parameterized long-range interactions into the kernel methods or neural network models, so that the resulting model can characterize both short-range, as well as long-range electrostatic interactions. In the neural network context, the computational cost of treating the LRIs using these methods can grow superlinearly with the system size. The idea of this work is aligned with the approaches in the molecular modeling community, which constructs a neural network layer to directly describe the LRI. In particular, we present a new longrange convolutional (LRC)-layer, which performs a global convolutional operation in nearly-linear time with respect to number of units in the layer. By leveraging the non-uniform Fourier transform (NUFFT) (Dutt & Rokhlin, 1993; Greengard & Lee, 2004; Barnett et al., 2019) technique, the LRC-layer implements a convolution with a point-wise multiplication in the frequency domain with trainable weights known as Fourier multipliers. The NUFFT is based on the regular fast Fourier transform (FFT) (Cooley & Tukey, 1965) with a fast gridding algorithms, to allow for fast convolution on unstructured data. This new LRC-layer provides a new set of descriptors that can seamlessly satisfy relevant symmetries. For instance, when the kernel of the LRI is rotationally invariant, such symmetry can be directly built into the parameterization of the Fourier kernel. Such descriptors can be used in tandem with the descriptors provided by short-range convolutional layers to improve the performance of the neural network. Efficient training of a neural network with the LRC-layer for capturing the information of LRIs is another challenging problem. Short-range models can often be trained with data generated with a relatively small computational box (called the small-scale data), and they can be seamlessly deployed in large-scale systems without significantly increasing the generalization error. On the other hand, long-range models need to be trained directly with data generated in a large computational box (called the large-scale data), and the generation process of such large-scale data can be very expensive. For instance, in molecular modeling, the training data is often generated with highly accurate quantum mechanical methods, of which the cost can scale steeply as O(N α ), where N is the system size and α ≥ 3. Therefore it is desirable to minimize the number of samples with a large system size. In many applications, the error of the effective short-range model is already modestly small. This motivates us to propose a two-scale training strategy as follows. We first generate many small-scale data (cheaply and possibly in parallel), and train the network without the LRC-layer. Then we use a small number of large-scale data, and perform training with both the short-and long-range convolutional layers. In order to demonstrate the effectiveness of the LRC-layer and the two-scale training procedure, we apply our method to evaluate the energy and force associated with a model N -body potential that exhibit tunable short-and long-range interactions in one, two and three dimensions. The input point cloud consists of the atomic positions, and the output data include the N -body potential, local potential, and the force (derivative of the N -body potential with respect to atomic positions). In particular, the local potential and the force can be viewed as point clouds associated with the atomic positions. The evaluation of the N -body potential is a foundational component in molecular modeling, and LRI plays an important role in the description of ionic systems, macroscopically polarized interfaces, electrode surfaces, and many other problems in nanosciences (French et al., 2010) . Our result verifies that the computational cost of the long-range layer can be reduced from O(Nfoot_1 ) using a direct implementation, to O(N ) (up to logarithmic factors) using NUFFT. Furthermore, we demonstrate that the force, i.e. the derivatives of the potential with respect to all inputs can be evaluated with O(N ) cost (up to logarithmic factors). In terms of sample efficiency, we find that for the model problem under study here, the two-scale training strategy can effectively reduce the number of large-scale samples by over an order of magnitude to reach the target accuracy. This can be particularly valuable in the context of molecular modeling, where accurate data are often obtained from first principle electronic structure calculations. Such calculations are often very expensive for large scale systems, and the number of large-scale samples is thus limited.

2. LONG-RANGE CONVOLUTIONAL LAYER

Convolutional layers are perhaps the most important building-block in machine learning, due to their great success in image processing and computer vision. A convolutional layer convolves the input, usually an array, with a rectangular mask containing the trainable parameters. When the mask can be kept small (for example while extracting localized features), the convolution layer is highly efficient and effective. A different way for computing a convolution is to use the convolutional theorem as follows: (1) compute the Fourier transform of the input, (2) multiply with the Fourier transform of the mask, i.e.m the Fourier multiplier, and (3) inverse Fourier transform back. In this case, the trainable parameters are the DOFs of the Fourier multipliers and the Fourier transforms are computed using the fast Fourier transform (FFT). This alternative approach is particularly attractive for smooth kernels with large support (i.e., smooth long-range interactions) because the computational cost does not increase with the size of the mask. To the best of our knowledge, this direction has not been explored for LRIs and below we detail now to apply this to point clouds. Given a point cloud {x i } N i=1 ⊂ R d and scalar weights {f i } N i=1 , we consider the problem of computing the quantity u i := N j=1 φ θ (x ix j )f j at each x i . Here the function φ θ (•) is the kernel with a generic trainable parameter θ. At first glance the cost of this operation scales as O(N 2 ): we need to evaluate u i for each point x i , which requires O(N ) work per evaluation. By introducing a generalized function f (y) = i f i • δ(yx i ) and defining a function u(x) = φ θ (xy)f (y)dy, one notices that u i is the value of u(x) at x = x i . The advantage of this viewpoint is that one can now invoke the connection between convolution and Fourier transform û(k) = φθ (k) • f (k), where φθ (k) is a trainable Fourier multiplier. This approach is suitable for point clouds since the trainable parameters are decoupled from the geometry of the point cloud. To make this approach practical, one needs to address two issues: (1) the non-uniform distribution of the point cloud and (2) how to represent the multiplier φθ (k). Non-uniform distribution of the point cloud Equation 1 suggests that one can compute the convolution directly using the convolution theorem, which typically relies on the FFT to obtain a low-complexity algorithm. Unfortunately, {x i } N i=1 do not form a regular grid, thus FFT can not be directly used. We overcome this difficulty by invoking the NUFFTfoot_0 (Dutt & Rokhlin, 1993) , which serves as the corner-stone of our instance of the LRC-layer 2 . Algorithm 1 Long-range convolutional layer Input: {x i } N i=1 , {f i } N i=1 Output: {x i } N i=1 , {u i } N i=1 , where u i = N j=1 f j φ θ (x i -x j ). 1: Define the generalized function: f (x) = N j=1 f j δ(x -x j ) 2: Mollify the Dirac deltas: f τ (x) = N j=1 f j g τ (x -x j ) , where g τ is defined in Appendix C.2 3: Sample in a regular grid: f τ (x ) = N j=1 g τ (x -x j ) for x in grid of size L FFT in each dim 4: Compute FFT: F τ (k) = FFT(f τ )(k) 5: Re-scale the signal: F (k) = π τ e k 2 τ F τ (k) 6: Multiply by Fourier multipliers: v(k) = φθ (k) • F (k) 7: Re-scale the signal: v-τ (k) = π τ e k 2 τ v(k) 8: Compute IFFT: u -τ (x ) = IFFT(v -τ )(x) for x on the regular grid 9: Interpolate to the point cloud: u i = u(x i ) = u -τ * g τ (x i ) The LRC-layer is summarized in Alg. 1, where τ is chosen following Dutt & Rokhlin (1993) . The inputs of this layer are the point cloud {x i } N i=1 and the corresponding weights {f i } N i=1 . The outputs are u i ≡ u(x i ) for i = 1, ..., N . The number of elements in the underlying grid N FFT = L d FFT is chosen such that the kernel is adequately sampled and the complexity remains low. As shown in Appendix C.5, one only needs a relatively small L FFT . Even though the precise number is problemspecific, given that the goal is to approximate LRIs that are supposedly smooth, it can be captured with a relatively small number of Fourier modes. The LRC-layer is composed of three steps: (1) It computes the Fourier transform from the point cloud to a regular grid using the NUFFT algorithm (lines 2 -5 in Alg. 1 and showcased in Fig. 2 ). ( 2) It multiplies the result by a set of trainable Fourier multipliers (line 6 in Alg. 1). ( 3) It computes the inverse Fourier transform from the regular grid back to the point cloud (lines 7 -9 in Alg. 1). Within the LRC-layer in Alg. 1, the only trainable component is the parameter θ of the Fourier multiplier φθ (k). The remaining components, including the mollifier g τ (•) and the Cartesian grid size, are taken to be fixed. One can, in principle, train them as well, but it comes with a much higher cost. Among the steps of Alg. 1, the sampling operator, the rescaling operator, the interpolation operator, and the Fourier transforms, are all linear and non-trainable. Therefore, derivative computations of backpropagation just go through them directly. Alg. 1 is presented in terms of only one single channel or feature dimension, i.e., f j ∈ R and u i ∈ R. However, it can be easily generalized to multiple channels, for example f j ∈ R d1 and u i ∈ R d2 . In this case, the Fourier multiplier φθ (k) at each point k is a d 2 × d 1 matrix, and all Fourier transforms are applied component-wise. Representation of the Fourier multiplier A useful feature of the LRC-layer is that it is quite easy to impose symmetries on the Fourier multipliers. For example, if the convolution kernel φ θ (•) is constrained to have parity symmetry, rotational symmetry, smoothness or decay properties, these constraints can be imposed accordingly on the coefficients of the Fourier multipliers φθ (k). When the size of the training data is limited, it is often necessary to reduce the number of trainable parameters in order to regularize the kernel. For example, we may parameterize the Fourier multiplier as a linear combination of several predetermined functions on the Fourier grid. This is the procedure used in molecular modeling (Grisafi & Ceriotti, 2019; Yao et al., 2018; Ko et al., 2009) , and also in our numerical examples in equation 7. We also remark that the LRC-layer described here can be applied to point clouds a way similar to a standard convolution layer applied to images and multiple LRC-layers can be composed on top of each other. the kernel is fixed. However, in order for the kernel to be trainable, this would require a different algorithm for each iteration, including the computation of the derivatives, thus increasing the computational cost and rendering the implementation significantly more cumbersome.  f < l a t e x i t s h a 1 _ b a s e 6 4 = " C z g O s C b 5 d n 9 r K C M H R 0 q b P m N Y a K 8 = " > A A A B 6 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l E s c e C F 4 8 t 2 A 9 o Q 9 l s J + 3 a z S b s b o Q S + g u 8 e F D E q z / J m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 I B F c G 9 f 9 d g o b m 1 v b O 8 X d 0 t 7 + w e F R + f i k r e N U M W y x W M S q G 1 C N g k t s G W 4 E d h O F N A o E d o L J 3 d z v P K H S P J Y P Z p q g H 9 G R 5 C F n 1 F i p G Q 7 K F b f q L k D W i Z e T C u R o D M p f / W H M 0 g i l Y Y J q 3 f P c x P g Z V Y Y z g b N S P 9 W Y U D a h I + x Z K m m E 2 s 8 W h 8 7 I h V W G J I y V L W n I Q v 0 9 k d F I 6 2 k U 2 M 6 I m r F e 9 e b i f 1 4 v N W H N z 7 h M U o O S L R e F q S A m J v O v y Z A r Z E Z M L a F M c X s r Y W O q K D M 2 m 5 I N w V t V 4 A n I h R i m / 3 J e H l w y Q o m s m 9 U = " > A A A B 6 H i c b V D L T g J B E O z F F + I L 9 e h l I j H x R H Y N R o 4 k X j x C I o 8 E N m R 2 6 I W R 2 d n N z K y R E L 7 A i w e N 8 e o n e f N v H G A P C l b S S a W q O 9 1 d Q S K 4 N q 7 7 7 e Q 2 N r e 2 d / K 7 h b 3 9 g 8 O j 4 v F J S 8 e p Y t h k s Y h V J 6 A a B Z f Y N N w I 7 C Q K a R Q I b A f j 2 7 n f f k S l e S z v z S R B P 6 J D y U P O q L F S 4 6 l f L L l l d w G y T r y M l C B D v V / 8 6 g 1 i l k Y o D R N U 6 6 7 n J s a f U m U 4 E z g r 9 F K N C W V j O s S u p Z J G q P 3 p 4 t A Z u b D K g I S x s i U N W a i / J 6 Y 0 0 n o S B b Y z o m a k V 7 2 5 + J / X T U 1 Y 9 a d c J q l B y Z a L w l Q Q E 5 P 5 1 2 T A F T I j J p Z Q p r i 9 l b A R V Z Q Z m 0 3 B h u C t v r x O W l d l r 1 K + b l R K t W o W R x 7 O 4 B w u w Y M b q M E d 1 K E J D B C e 4 R X e n A f n x X l 3 P p a t O S e b O Y U / c D 5 / A O X P j P o = < / l a t e x i t > {(f i , x i )} N i=1 < l a t e x i t s h a 1 _ b a s e 6 4 = " 9 s T + z s i 9 9 Z H L m W b k w s J A 3 s B X K b w = " > A A A C H H i c b V D L S g M x F M 3 4 r P U 1 6 t J N s A g V p M x o x W 6 E g h t X U s E + o F N L J s 2 0 o Z k H y R 2 x D P M h b v w V N y 4 U c e N C 8 G / M t F 1 o 6 4 V w D u f c y 8 0 9 b i S 4 A s v 6 N h Y W l 5 Z X V n N r + f W N z a 1 t c 2 e 3 o c J Y U l a n o Q h l y y W K C R 6 w O n A Q r B V J R n x X s K Y 7 v M z 8 5 j 2 T i o f B L Y w i 1 v F J P + A e p w S 0 1 D V P H c E 8 c J I x F L 1 u w t N j / J C B I 3 l / A E c T c F I t X d j p X X K d 5 r t m w S p Z 4 8 L z x J 6 S A p p W r W t + O r 2 Q x j 4 L g A q i V N u 2 I u g k R A K n g q V 5 J 1 Y s I n R I + q y t a U B 8 p j r J + L g U H 2 q l h 7 1 Q 6 h c A H q u / J x L i K z X y X d 3 p E x i o W S 8 T / / P a M X i V T s K D K A Y W 0 M k i L x Y Y Q p w l h X t c M g p i p A m h k u u / Y j o g k l D Q e W Y h 2 L M n z 5 P G S c k u l 8 5 u y o V q Z R p H D u 2 j A 1 R E N j p H V X S F a q i O K H p E z + g V v R l P x o v x b n x M W h e M 6 c w e + l P G 1 w 9 4 2 6 I w < / l a t e x i t > f ⌧ (x) = N X j=1 f j g ⌧ (x x j ) < l a t e x i t s h a 1 _ b a s e 6 4 = " O i g v C a 5 o 4 z D t 9 I M j 0 6 B I + x V C T F Q = " > A A A C G n i c b V D L S g M x F M 3 U V 6 2 v U Z d u g k V o F 5 Y Z q d h N o e D G l V S w D 2 j H I Z N m 2 r S Z B 0 l G W o b 5 D j f + i h s X i r g T N / 6 N a T u I t h 4 I H M 4 5 l 5 t 7 n J B R I Q 3 j S 8 u s r K 6 t b 2 Q 3 c 1 v b O 7 t 7 + v 5 B U w Q R x 6 S B A x b w t o M E Y d Q n D U k l I + 2 Q E + Q 5 j L S c 0 e X U b 9 0 T L m j g 3 8 p J S C w P 9 X 3 q U o y k k m z d d O 2 4 K 1 G U F M Z F W I V d E X l 2 P K y a y V 1 8 n U D X H s L + T + A U j u 1 h 0 d b z R s m Y A S 4 T M y V 5 k K J u 6 x / d X o A j j / g S M y R E x z R C a c W I S 4 o Z S X L d S J A Q 4 R H q k 4 6 i P v K I s O L Z a Q k 8 U U o P u g F X z 5 d w p v 6 e i J E n x M R z V N J D c i A W v a n 4 n 9 e J p F u x Y u q H k S Q + n i 9 y I w Z l A K c 9 w R 7 l B E s 2 U Q R h T t V f I R 4 g j r B U b e Z U C e b i y c u k e V Y y y 6 X z m 3 K + V k n r y I I j c A w K w A Q X o A a u Q B 0 0 A A Y P 4 A m 8 g F f t U X v W 3 r T 3 e T S j p T O H 4 A + 0 z 2 + P 5 Z / q < / l a t e x i t > f ⌧ (x `) = N X j=1 g ⌧ (x ` x j ) < l a t e x i t s h a 1 _ b a s e 6 4 = " x B A 9 / P T g 3 e H Q 5 a v p S k V x f J f a L O c = " > A A A C J X i c b V D L S g M x F M 3 U V 6 2 v U Z d u g k V o F 5 Y Z q d i F h Y I b V 1 L B P q A z D p k 0 0 6 b N P E g y 0 j L M z 7 j x V 9 y 4 s I j g y l 8 x f S y 0 9 U D g c M 6 5 3 N z j R o w K a R h f W m Z t f W N z K 7 u d 2 9 n d 2 z / Q D 4 + a I o w 5 J g 0 c s p C 3 X S Q I o w F p S C o Z a U e c I N 9 l p O U O b 6 Z + 6 4 l w Q c P g Q Y 4 j Y v u o F 1 C P Y i S V 5 O j X n p N Y E s V p Y a Q I Y S w t w i q 0 R O w 7 y a B q p o / J X Q p 7 y x l 4 D k f O o O j o e a N k z A B X i b k g e b B A 3 d E n V j f E s U 8 C i R k S o m M a k b Q T x C X F j K Q 5 K x Y k Q n i I e q S j a I B 8 I u x k d m U K z 5 T S h V 7 I 1 Q s k n K m / J x L k C z H 2 X Z X 0 k e y L Z W 8 q / u d 1 Y u l V 7 I Q G U S x J g O e L v J h B G c J p Z b B L O c G S j R V B m F P 1 V 4 j 7 i C M s V b E 5 V Y K 5 f P I q a V 6 U z H L p 8 r 6 c r 1 U W d W T B C T g F B W C C K 1 A D t 6 A O G g C D Z / A K 3 s F E e 9 H e t A / t c x 7 N a I u Z Y / A H 2 v c P O z W l C Q = = < / l a t e x i t > F ⌧ (k) = FFT(f ⌧ )(k) < l a t e x i t s h a 1 _ b a s e 6 4 = " 0 P d h r 0  3 P R t 6 u 7 u F Q m s G C y 9 1 1 y O Y = " > A A A C F X i c b V D L S g M x F M 3 U V 6 2 v q k s 3 w S K 0 I G V G K n Y j F I T i s k J f 0 J Y h k 2 b a 0 M y D 5 I 5 Y h v k J N / 6 K G x e K u B Z 8 C r x F q Q A l q g Y e e / + s O A R h 7 z g Q q i V M 8 y Q x j E R A K n g i W 5 f q R Y S O i E j F h P U 5 9 4 T A 3 i 2 V Y J P t P K E L u B 1 M c H P F N / T 8 T E U 2 r q O T r p E R i r Z S 8 V / / N 6 E b j V Q c z 9 M A L m 0 / l D b i Q w B D i t C A + 5 Z B T E V B N C J d d / x X R M J K G g i 8 z p E q z l l V d J + 6 J s V c q X d 5 V C r b q o I 4 t O 0 C k q I g t d o R q 6 R Q 3 U Q h Q 9 o m f 0 i t 6 M J + P F e D c + 5 t G M s Z g 5 R n 9 g f P 4 A X k u e T w = = < / l a t e x i t > F (k) = r ⇡ ⌧ e k 2 ⌧ F ⌧ (k) < l a t e x i t s h a 1 _ b a s e 6 4 = " s 7  + + N o g c T u V w z Y A F v V p P j I h c 0 y 4 = " > A A A C I H i c b V D L S g M x F M 3 4 r P V V d e k m W I S 6 K T N F 0 Y 1 Q F I r L C r Y V O r V k 0 k w b m n m Y 3 B F K m E 9 x 4 6 + 4 c a G I 7 v R r T N t Z a O u B w M k 5 9 5 D c 4 8 W C K 7 D t L 2 t h c W l 5 Z T W 3 l l / f 2 N z a L u z s N l W U S M o a N B K R v P W I Y o K H r A E c B L u N J S O B J 1 j L G 1 6 O / d Y D k 4 p H 4 Q 2 M Y t Y J S D / k P q c E j N Q t n N Z K w 6 N z V 9 1 L 0 K 4 v C d V u z F P t A k n S F L M 7 P b z T l R R P 7 r j W n R o m 0 i 0 U 7 b I 9 A Z 4 n T k a K K E O 9 W / h 0 e x F N A h Y C F U S p t m P H 0 N F E A q e C p X k 3 U S w m d E j 6 r G 1 o S A K m O n q y Y I o P j d L D f i T N C Q F P 1 N 8 J T Q K l R o F n J g M C A z X r j c X / v H Y C / l l H 8 z B O g I V 0 + p C f C A w R H r e F e 1 w y C m J k C K G S m 7 9 i O i C m J j C d 5 k 0 J z u z K 8 6 R Z K T v H 5 Z P r 4 2 L 1 I q s j h / b R A S o h B 5 2 i K r p C d d R A F D 2 i Z / S K 3 q w n 6 8 V 6 t z 6 m o w t W l t l D f 2 B 9 / w B J 3 K O s < / l a t e x i t >

3. LEARNING THE N -BODY POTENTIAL

To demonstrate the effectiveness of the LRC-layer, we consider the problem of learning the energy and force associated with a model N -body potential in the context of molecular modeling. As mentioned in Section 1, the potential evaluation often invokes expensive ab-initio calculations that one would like to bypass for efficiency reasons. The setup of this learning problem is as follows. First, we assume access to a black-box model potential, which consists of both short-and long-range interactions. However, internal parameters of the potential are inaccessible to the training architecture and algorithm. A set of training samples are generated by the model, where each sample consists of a configuration of the points {x i } along with the potential and force. Second, we set up a deep neural network that includes (among other components) the LRC-layer for addressing the long-range interaction. This network is trained with stochastic gradient type of algorithms using the collected dataset and the trained network can be used for predicting the potential and forces for new point cloud configurations. These two components are described in the following two subsections in detail.

3.1. MODEL PROBLEM AND DATA GENERATION

Model We suppose that Ω = [0, L] d , and we denote the point cloud by x = {x i } N i=1 ⊂ Ω ⊂ R d , for d = 1, 2, or 3. We define the total energy, the local potential and the forces acting on particle j by U = 1≤i<j≤N ψ(x i -x j ), U j (x) = i =j ψ(x i -x), F j = -∂ x U j (x)| x=xj , respectively, where the interaction kernel ψ(r) is a smooth function, besides a possible singularity at the origin and decreases as r → ∞. Sampling We define a snapshot as one configurationfoot_2 of particles, x = {x [ ] j } N j=1 , together with the global energy U [ ] and the forces F [ ] , where is the index representing the number in the training/testing set. We sample the configuration of particles x randomly, with the restriction that two particles can not be closer than a predetermined value δ min in order to avoid the singularity. After an admissible configuration is computed we generate the energy and forces following Appendix B. This process is repeated until obtaining N sample snapshots.

3.2. ARCHITECTURE

Our network architecture consists of separate descriptors for the short-interactions and long-range interactions, respectively. To capture the short-range interaction, we compute a local convolution using for each point only its neighboring points within a ball of predetermined radius. For the long-range interactions, we compute an all-to-all convolution using the LRC-layer introduced in Section 2, whose output is distributed to each particle and then fed to a sequence of subsequent layers. Short-range descriptor For a given particle x i , and an interaction radius R, we define I i , the interaction list of x i , as the indices j such that x ix j < R, i.e., the indices of the particles that are inside a ball of radius R centered at x i . Thus for each particle x i we build the generalized coordinates s i,j = x ix j , and the short-range descriptor D i sr = j∈Ii f θ (s i,j ), where f θ : R d → R msr is a function represented by a neural network specified in Appendix C.1, where m sr is the number of short-range features. By construction f θ (s) is smooth and it satisfies f θ (s) = 0 for s > R. Long-range descriptor We feed the LRC-layer with the raw point cloud represented by {x i } N i=1 with weights {f i } N i=1 , which for simplicity can be assumed to be equal to one here, i.e., f i = 1 for i = 1, ..., N . The output of the layer is a two-dimensional tensor u k (x i ) with i = 1, . . . , N and k = 1, . . . , K chnls . Then for each x i , its corresponding slice given by the vector [u 1 (x i ), u 2 (x i ), • • • , u Kchnls (x i )] , is fed to a function g θ : R Kchnls → R mlr , which is represented by a neural network with non-linear activation functions. Here θ is a generic set of trainable parameters and m lr is the number of long-range features. The descriptor for particle x i , which depends on all the other particles thanks to the LRC-layer, is defined by D i lr = g θ (u 1 (x i ), u 2 (x i ), • • • , u Kchnls (x i )) Short-range network When only the short-range interaction is present, the short-range descriptor for each particle is fed particle-wise to a fitting network F sr : R msr → R. In this case F sr (D i sr ) only depends on particle x i and its neighbors. Finally, the contributions from each particle are accumulated so the short-range neural network (NN) energy and forces are given by U NN sr = N i=1 F sr (D i sr ) and (F NN sr ) j = -∂ xj U NN sr (5) respectively (see Fig. 2 (left)). The derivatives are computed using Tensorflow (Abadi et al., 2015) directly. This network as shown by (Zhang et al., 2018b ) is rotation, translation, and permutation invariant (Zaheer et al., 2017) . We point out that this architecture can be understood as a non-linear local convolution: for each particle i one applies the same function f θ to each of its neighbors. The result is then pooled into the descriptor D i sr , then processed locally by F sr (akin to a non-linear convolution with a filter of width one), and finally pooled globally into U NN sr . Full-range network When both the short-range and long-range interactions are present, the long range descriptor and the local descriptor are combined and fed particle-wise to a fitting network F : R msr+mlr → R to produce the overall neural network (NN) energy and forces U NN = N i=1 F(D i sr , D i lr ), (F NN ) j = -∂ xj U NN (6) respectively (see Fig. 2 (right)). Following Section 2, the long-range descriptor is translation invariant by design and can be easily made rotation invariant. Furthermore, it is well known (Zaheer et al., 2017) that this construction is permutation invariant. Further details on the implementation of the network can be found in Appendix C.3. From the structures shown in Fig. 2 4 , it is clear that we can recover the first architecture from the second, by zeroing some entries at the fitting network, and removing the LRC-layer. 

4. NUMERICAL RESULTS

The loss function is the mean squared error of the forces 1 Nsample Nsample =1 N i=1 F NN θ (x [ ] i ) -F [ ] i 2 , where the i-index runs on the points of each snapshot, and runs on the test samples. We also generate 100 snapshots of data to test the performance of network. This particular loss could lead to shift the potential energy by up to a global constant, which can be subsequently fixed by including the error of the energy in the loss (Zhang et al., 2018b) . For the testing stage of we use the relative 2 error of the forces as metric, which is defined as rel := ,i F [ ] i -F NN θ (x [ ] i ) 2 / ,i F [ ] i 2 . The standard training parameters are listed in Appendix C.4. The experiments shown in the sequel are designed to provide a fair comparison with state-of-the-art methods for localized interactions. They showcase that, by adding a single LRC-layer, one can outperform these methods significantly. The kernels ψ used in the experiment typically exhibit two interaction lengths: ψ(•) ≡ α 1 ψ µ1 (•) + α 2 ψ µ2 (•), where each of ψ µ1 and ψ µ2 is either a simple exponential kernel or screened-Coulomb kernel (also known as the Yukawa kernel). For each of ψ µ1 and ψ µ2 , the superscripts denote the reciprocal of the interaction length, i.e., length scale ∼ µ -1 1 or ∼ µ -1 2 . Without loss of generality, µ 1 > µ 2 , so that µ 1 corresponds to the short-range scale and µ 2 the long-range scale. We also assume that 0 ≤ α 2 ≤ α 1 and α 1 + α 2 = 1, so that the effect of the long-range interaction can be smaller in magnitude compared to that of the short-range interaction. In the special case of α 2 = 0, the kernel exhibits only a single scale ∼ µ -1 1 . The precise definition of the kernel depends on the spatial dimension and boundary conditions, which are explained in Appendix B. For a fixed set of kernel parameters (µ 1 , µ 2 , α 1 , α 2 ), we consider two types of data: large-and small-scale data, generated in the domains Ω lr and Ω sr respectively (details to be defined in each experiment). The Fourier multiplier within the LRC-layer is parameterized as φβ,λ (k) = 4πβ |k| 2 + λ 2 , ( ) where β and λ are trainable parameters. This is a simple parameterization, and a more complex model can be used as well with minimal changes to the procedure. For all experiments shown below, two kernel channels are used and as a result there are only four trainable parameters in the LRC-layer. The numerical results aim to show namely two properties: i) the LRC-layer is able to efficiently capture LRIs, and ii) the two-scale training strategy can reduce the amount of large-scale data significantly. To demonstrate the first property, we gradually increase the interaction length of the kernel. The accuracy of the short-range network with a fixed interaction radius is supposed to decrease rapidly, while using the LRC-layer improves the accuracy significantly. To show the second property, we generate data with two interaction lengths and train the full-range network using the one-and Table 1 : Relative testing error for trained screened-Coulomb type 1D models with α 1 = 1, α 2 = 0, and varying µ 1 . Notice that µ 2 can be arbitrary here given that α 2 = 0. µ 1 0.5 1.0 2.0 5.0 10.0 short-range network 0.05119 0.02919 0.00597 0.00079 0.00032 full-range network 0.00828 0.00602 0.00336 0.00077 0.00054 two-scale strategies. Finally, we also aim to demonstrate that the LRC-layer is competitive against a direct convolution in which the all-to-all computation is performed explicitly.

1D

In the first set of experiments, the domain Ω = [0, 5], N = 20 and N sample = 1000, where N sample is the number of snapshots and N is the total number of points in each snapshot. For the kernel, we set α 2 and vary µ 1 to generate datasets at different interaction lengths. For each dataset we train both short-range and full-range networks using the one-scale data. The results are summarized in Table 1 , where we can observe that as the characteristic interaction length increases, the accuracy of the short-range network decreases while using the full-range network can restore the accuracy. This experiment shows that local networks are often highly accurate when the interactions are localized, but the accuracy quickly deteriorates as the interaction length increases (i.e. as µ 1 decreases). For the second set of experiments we used two sets of kernel parameters: one heavily biased towards a localized interaction length, and another in which both interaction lengths are equally weighted. For each set of kernel parameters, we generate 10, 000 small-scale snapshots using Ω sr = [0, 5] and N = 20, and a large number of large-scale snapshots using Ω lr = [0, 50] and N = 200 particles. The interaction radius R = 1.5, δ min = 0.05, and N FFT is 501. We train the network with the oneand two-scale training strategies described in the prequel. Fig. 3 (left) depicts the advantage of using the two-scale training strategy: we obtain roughly the same accuracy at a fraction of the number of large-scale training samples. We observe that when the number of large-scale training samples is sufficiently large, the resulting test accuracy is independent of the training strategy. We also observe that the training dynamics is stable with respect to different random seeds. We compare the LRC-layer with a direct all-to-all computation.We benchmark the wall time of both layers, with increasingly number of particles. To account for implementation effects we normalize the wall times in Fig. 3 (right) and the results corroborate the complexity claims made in Section 2. 2D We perform the same experiments as in the one-dimensional case. We fix Ω = [0, 15] 2 , N = 450 and N sample = 10000. The results are summarized in Table 2 , which shows that as µ decreases, the full-range network outperforms the short-range one. For the second set of experiments, R = 1.5, δ min = 0.05, and N FFT is 31 2 . For the small-scale data, Ω sr = [0, 3] 2 , N = 18, and N sample = 10, 000. For the large-scale data, Ω lr = [0, 15] 2 , N = 450. Similarly to the 1D case, we train the networks with both strategies using different amounts of large-scale data. The results summarized in Fig. 4 show that the two-scale strategy efficiently captures the long-range interactions with only a small number of the long-range training samples. Table 2 : Relative testing error for trained screened-Coulomb type 2D models with α 1 = 1, α 2 = 0, and varying µ 1 . Again µ 2 can be arbitrary given that α 2 = 0. µ 1 1.0 2.0 5.0 10.0 short-range network 0.07847 0.02332 0.00433 0.00242 full-range network 0.00785 0.00526 0.00363 0.00181 3 demonstrate that full-range network is capable of maintaining good accuracy for a wide range of characteristic interactions lengths.

5. CONCLUSION

We have presented an efficient long-range convolutional (LRC) layer, which leverages the nonuniform fast Fourier transform (NUFFT) to reduce the cost from quadratic to nearly-linear with respect to the number of degrees of freedom. We have also introduced a two-scale training strategy to effectively reduce the number of large-scale samples. This can be particularly important when the generation of these large-scale samples dominates the computational cost. While this paper demonstrates the effectiveness of the LRC-layer for computing the energy and force associated with a model N -body potential, we expect the LRC-layer to become a useful component in designing neural networks for modeling real chemical and materials systems, where the LRI cannot be accurately captured using short ranged models. We also expect that the LRC-layer can be a useful tool for a wide range of machine learning (such as regression and classification) tasks. 

A NOTATION

A table of notations is summarized in Table 4 .

B DATA GENERATION

We provide further details about the data generation process and how the parameter µ dictates the characteristic interaction length. Exponential kernel: Suppose Ω be the torus [0, L] d and that x = {x i } N i=1 ⊂ Ω ⊂ R d for d = 1, 2, or 3. The exponential kernel is defined as ψ µ (x -y) = e -µ x-y , ( ) where • is the Euclidean norm over the torus. Following Section 3.1 we define the total energy and the potential as U = N i<j e -µ xi-xj and U j (x) = N i =j e -µ xi-x , respectively. The forces are given by F j = -∂ xj U j (x j ) = - N i =j x i -x j x i -x j µe -µ xi-xj . Screened-Coulomb kernel: In 3D, the screened-Coulomb potential with free space boundary condition is given by  ψ µ (x -y) = 1 4π x -y e -µ x-y . FFT = L d FFT . Total number of Fourier modes Over the torus [0, L] d , the kernel ψ µ (xy) is the Green's function G µ (x, y) defined via ∆G µ (x, y) -µ 2 G µ (x, y) = -δ y (x), with the periodic boundary condition. In order to compute the screened-Coulomb potential numerically, a spectral method is used: in particular, ψ µ (x -y) = G µ (x, y) = F -1 e ik•y k 2 + µ 2 χ (k) , where F -1 stands for the inverse Fourier transform and χ (k) is a smoothing factor, usually Gaussian, to numerically avoid the Gibbs phenomenon. Similar to the exponential case, the parameter µ controls the localization of the potential. In addition, the derivatives are taken numerically in the Fourier domain. Visualization: To visualize the relation between µ and the characteristic interaction length in 1D, consider a given particle, e.g., x 100 and compute the force contribution from the other particles. Fig. 5 shows that force contribution is extremely small outside a small interaction region for µ = 5.0 while the interaction region for µ = 0.5 is much larger.

C DETAILS OF ARCHITECTURE AND TRAINING C.1 SHORT-RANGE DESCRIPTOR

Here we specify the structure of D i introduced in Section 3.2. For a given particle x i , and an interaction radius R, define the interaction list I i of x i as the set of indices j such that x ix j < R, where • stands for the distance over the torus [0, L] d . To simplify the discussion, we assume that there exists a maximal number of neighbors N maxNeigh for each x i . We stack the neighbors in a tensor whose dimensions are constant across different particles. This value is chosen to be sufficiently large to cover the number of elements in the interaction list. If the cardinality of I i is less than N maxNeigh , we pad the tensor with dummy values. In the 1D case the generalized coordinates are defined as s i,j = x i -x j , r i,j = 1 x i -x j for j ∈ I i . We introduce two fully-connected neural networks f θ1 , f θ2 : R + → R msr/2 , where each consists of five layers with the number of units doubling at each layer and ranging from 2 to 32. The activation function after each layer is tanh and the initialization follows Glorot normal distribution. For particle x i the short-range descriptor is defined as the concatenation of D i 1,sr = j∈Ii f θ1 (ŝ i,j )r i,j and D i 2,sr = j∈Ii f θ2 (r i,j )r i,j , where ri,j , ŝi,j are the normalized copies of r i,j and s i,j with mean zero and standard deviation equals to one. The mean and standard deviation are estimated by using a small number of snapshots. We multiply the network's output f θ by ri,j (which is zero if j is a dummy particle). This procedure enforces a zero output for particles not in the interaction list. The construction satisfies the design requirement mentioned in Section 3.2. In the short-range network, one concatenates the two descriptor above and feeds them particle-wise to the short-range fitting network. The fitting network F sr : R msr → R is a residual neural network (ResNet) with six layers, each with 32 units. The activation function and initialization strategy are the same as the ones for the short-range descriptors. Fig. 6 shows the detailed architecture of the short-range network. U NN sr = N i=1 F(D i sr ) = N i=1 F(D i 1,sr , D i 2,sr ) In 2D and 3D, there is a slight difference of generalized coordinates: we compute s i,j = x ix j x ix j and r i,j = 1 x i -x j , where s i,j is a vector now. The local descriptors are defined in the following forms: D i 1,sr = j∈Ii f θ1 (s i,j )r i,j and D i 2,sr = j∈Ii f θ2 (r i,j )r i,j C.2 NUFFT In this section we provide further details for the NUFFT implementation. Suppose that the input of the NUFFT is given by {x i } N i=1 ⊂ R d , where each point has a given associated weight f i . The first  We point out that in some of the experiments f j simply equals to 1. One then defines a periodic Gaussian convolution kernel g τ (x) = ∈Z d e -x-L 2 /4τ , ( ) where L is the length of the interval and τ determines the size of mollification. In practice a good choice is τ = 12( L 2πLFFT ) 2 (Dutt & Rokhlin, 1993) , where L FFT is the number of points in each dimension and N FFT = L d FFT . We define f τ (x) = f * g τ (x) = [0,L] d f (y)g τ (x -y)dy = N j=1 f j g τ (x -x j ). With the Fourier transform defined as F τ (k) = 1 L d [0,L] d f τ (x)e -i2πk•x/L dx for k ∈ Z d , we compute its discrete counterpart F τ (k) ≈ 1 N FFT m∈[0,LFFT-1] d f τ (Lm/L FFT ) e -i2πk•m/LFFT (23) ≈ 1 N FFT m∈[0,LFFT-1] d N j=1 f j g τ (Lm/L FFTx j ) e -i2πk•m/LFFT (24) This operation can be done in O(N FFT log(N FFT )) steps, independently of the number of inputs. Once this is computed, one can compute the Fourier transform of f at each frequency point by F (k) = π τ d/2 e k 2 τ F τ (k) Once the Fourier transform of the Dirac delta train is ready, we multiply it by the Fourier multiplier φ(k), which is the Fourier transform of φ: For the two-dimensional case, a screened-Coulomb type potential is tested with µ 1 = 10.0, µ 2 = 1.0, α 1 = 0.9, α 2 = 0.1. Here Ω = [0, 5] 2 , N = 50 and N sample = 1000. Starting with N FFT = 21 2 , we steadily increase its value and repeat the same training procedure. The results are summarized in Table 6 where one observes the same trend as in the one-dimensional case. (38) where β and λ are two trainable parameters with λ providing a measure of the decay in space. Therefore, N FFT only determines the number of Fourier modes and not the parameters of the ansatz. v(k) = φ(k)F (k) As long as the Fourier kernel is properly sampled, the method is able to compute the correct characteristic interaction length. One can observe this phenomenon in the experiment above, in which we extract the terminal value after training of the parameters λ 1 and λ 2 that correspond to the two channels in the LRC-layer, as summarized in Table 7 . We observe that the value of λ 2 is very close to that of µ 2 , which is responsible for the LRIs even for small values of N FFT . 



See Appendix C.2 for further details. We point out, that one could in practice use an fast summation algorithm, such as the fast multipole method (FMM) introduced byGreengard & Rokhlin (1987), to evaluate ui. This would results in the same complexity if For the sake of clarity, we suppose that the number of particles at each configuration is the same. We provide more detailed schematics in Fig.6and Fig.7in Appendix C.1 See Appendix C.2 and C.3 for further details.



9 e Z 2 0 r 6 r e d f W m e V 2 p 1 / I 4 i n A G 5 3 A J H t x C H e 6 h A S 1 g g P A M r / D m P D o v z r v z s W w t O P n M K f y B 8 / k D y o e M 6 A = = < / l a t e x i t > x < l a t e x i t s h a 1 _ b a s e 6 4 = " c f 2

X c + T d m 2 i 6 0 9 U D C 4 Z x z S e 5 x Q s E V m O a 3 k V l b 3 9 j c y m 7 n d n b 3 9 g / y h 0 d t F U S S s h Y N R C C 7 D l F M c J + 1 g I N g 3 V A y 4 j m C d Z z J T e p 3 7 p l U P P C b M A 3 Z w C M j n 7 u c E t C S n T + v 2 3 0 g U X F S w t e 4 D + w B 4 t k N E N f r z S Q p u n a c B p K S j t j 5 g l k 2

Figure 1: Diagram of the NUFFT. Starting from the cloud point {x i } N i=1 , we form the mollified function f τ , sample it in a regular grid, compute the Fourier transform F τ (k) of the sampled function. Finally in order to obtain F (k), we rescale the signal to undo the spatial convolution.

Figure 2: (left) The short-range network architecture. (right) The full-range network architecture.

Figure 3: (left) Testing error of the trained 1D model with respect to the number of snapshots using the one-and two-scale training strategies using data generated with the screened-Coulomb potential and parameters µ 1 = 5.0, µ 2 = 0.5 (right) normalized wall-time for the LRC and the direct all-to-all computation.

Figure 4: Testing error of the trained 2D model with respect to the number of snapshots using the one-and two-scale training strategies using both screened-Coulomb and exponential potentials with µ 1 = 10, µ 2 = 1 : (left) α 1 = 0.9, and α 2 = 0.1; and (right) α 1 = 0.5, and α 2 = 0.5.

Figure 5: The force contribution to particle x 100 from other particles. Results are shown for two different characteristic interaction lengths.

Figure 6: The structure of short-range network for 1D case.

Figure 7: The structure of full-range network.

Relative testing error for trained exponential type 3D models with α 1 = 1, α 2 = 0, and varying µ 1 . Again µ 2 can be arbitrary given that α 2 = 0. Yao, J. E. Herr, D. W. Toth, R. Mckintyre, and J. Parkhill. The tensorMol-0.1 model chemistry: a neural network augmented with long-range physics. Chem. Sci., 9:2261-2269, 2018. X. Ye, J. Li, H. Huang, L. Du, and X. Zhang. 3D recurrent neural networks with context fusion for point cloud semantic segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), September 2018. . Zhang, J. Han, H. Wang, R. Car, and W. E. Deep potential molecular dynamics: A scalable model with the accuracy of quantum mechanics. Phys. Rev. Lett., 120:143001, Apr 2018a. L. Zhang, J. Han, Ha. Wang, W. Saidi, R. Car, and W. E. End-to-end symmetry preserving interatomic potential energy model for finite and extended systems. In Advances in Neural Information Processing Systems 31, pp. 4441-4451. 2018b. L. Zhang, M. Chen, X. Wu, H. Wang, W. E, and R. Car. Deep neural network for the dielectric response of insulators. arXiv:1906.11434, 2019. Y. Zhou and O. Tuzel. Voxelnet: End-to-End learning for point cloud based 3D object detection. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4490-4499, 2018.

Symbols introduced in the current paper with their corresponding meaning.

Error with respect to N FFT in the 2D case N FFT Relative testing error 0.01872 0.00202 0.00168 0.00153 0.00177In addition, we recall that the Fourier multipliers are parametrized following

Values of parameters λ 1 and λ 2 after training with respect to N FFT .

annex

In the next sage, one needs to compute the inverse transform, and evaluate into the target points {x i }. First we deconvolve the signaland compute the inverse Fourier transformNext, we interpolate to the point cloudEven though in the current implementation all the parameters of the NUFFT are fixed, they can in principle be trained along with the rest of the networks. This training, if done naively increases significantly the computational cost. How to perform this operation efficiently is a direction of future research.Derivatives For the computation of the forces in equation 5 one needs to compute the derivatives of the total energy U NN with respect to the inputs, in nearly-linear time. The main obstacle is how to compute the derivatives of the LRC-layer with respect to the point-cloud efficiently. To simplify the notation, we only discuss the case that d = 1, but the argument can be seamlessly extended to the case when d > 1.Recall that u i = N j=1 φ θ (x ix j )f j , then the Jacobian of the vector u with respect to the inputs is given byAs it will be explained in the sequel, for the computation of the forces in equation 5 one needs to compute the application of the Jacobian of u to a vector. For a fixed vector v ∈ R N , the product (∇u) • v can be written component-wise aswhere we have added ±v i f i φ (0) in the last equation and then distributed it within both sums. Let us define the following two long-range convolutionseach of which can be performed in O(N + N FFT log N FFT ) steps using the NUFFT algorithm combined with the convolution theorem. In this case the derivative of φ can be computed numerically in the Fourier domain to a very high accuracy. Now one can leverage the expression above to rewritewhich can then be computed in nearly-linear time. The same is also true for v • (∇u).

C.3 LONG-RANGE DESCRIPTOR

As mentioned before, the output of the LRC-layer is given by {u(x i )} N i=1 . For each particle we feed the output u(x i ) to the long-range descriptor network h θ : R → R mlr , whose structure is the same as the local descriptor f θ mentioned in appendix C.1 except that the activation function is taken to be ReLU. The long-range descriptor, defined asfor the particle x i is concatenated with the corresponding short-range descriptor (which it is itself the concatenation of two short-range descriptors) and fed together to the total fitting network F : R msr+mlr → R. The results are then added together to obtain the total energyIt is clear that the energy can be evaluated in nearly-linear complexity.In what follows we show that the force computation is also of nearly-linear. For simplicity we focus on the one-dimensional network and assume thatand that the depth of the neural networks is O(1). As defined in the prequel the forces are given by F NN = -∇ x U NN , Which can be written component wise asor in a more compact fashion asHere v sr , and v lr are vectors defined component-wise asIn addition (D sr ) i,j = ∂ xj D i sr and ∇u is defined above. The first term in the right-hand side is easy to compute, given that D sr is sparse: the i, j entry is non-zero only if the particle x i is in the interaction list of x j . Given that the cardinality of the interaction list is bounded, D sr has O(N ) non-zero entries in which each entry requires O(1) work, thus the first term in the right-hand side of equation 37 can be computed in O(N ). At first glance the complexity of second term seems to be much higher. However, as discussed above, by using equation 33, we can apply the matrix (or its transpose) to a vector in O(N + N FFT log N FFT ) time and the computation of vector v lr requires O(1) work per entry, thus resulting in a complexity of O(N + N FFT log N FFT ) for computing the second term in equation 37. Finally, adding both contributions together results in an overall O(N + N FFT log N FFT ) complexity for the forces.To summarize, both the computation of the energy and the forces can be performed in O(N ) time.

C.4 TRAINING

We use the Adam optimizer (Kingma & Ba, 2015) along with an exponential scheduler. The learning rate with the initial learning rate taken to be 0.001 and, for every 10 epochs, it decreases by a factor of 0.95. In order to balance the computational time and the accuracy, a multi-stage training is adopted, where at each stage we modify the batch-size and the number of epochs. In particular, four stages are used: we start using a batch size of 8 snapshots and train the network 200 epochs and then at each stage we double both the size of the batch size and the number of epochs. In the two-scale training strategy, the same training parameters defined above are used for each stage.

C.5 DEPENDENCY ON N FFT

We measure the impact of N FFT on the approximation error, using a couple of examples in the oneand two-dimensional settings.For the one-dimensional case, we test a screened-Coulomb type potential with parameters µ 1 = 5.0, µ 2 = 0.5, α 1 = 0.5, α 2 = 0.5, and N sample = 1000. The domain Ω is [0, 50] and N = 200. We run the one-scale training procedure with varying N FFT (the number of Fourier multipliers), starting from N FFT = 63 and doubling them until N FFT = 501. Table 5 shows that the errors are relatively insensitive to the value of N FFT . The accuracy achieved by the architecture without the LRC-layer (denoted as None in Tables 5) is added in order to demonstrate that the architecture is indeed capturing the LRIs.

