Graph Convolution with Low-rank Learnable Local Filters

Abstract

Geometric variations like rotation, scaling, and viewpoint changes pose a significant challenge to visual understanding. One common solution is to directly model certain intrinsic structures, e.g., using landmarks. However, it then becomes non-trivial to build effective deep models, especially when the underlying non-Euclidean grid is irregular and coarse. Recent deep models using graph convolutions provide an appropriate framework to handle such non-Euclidean data, but many of them, particularly those based on global graph Laplacians, lack expressiveness to capture local features required for representation of signals lying on the non-Euclidean grid. The current paper introduces a new type of graph convolution with learnable low-rank local filters, which is provably more expressive than previous spectral graph convolution methods. The model also provides a unified framework for both spectral and spatial graph convolutions. To improve model robustness, regularization by local graph Laplacians is introduced. The representation stability against input graph data perturbation is theoretically proved, making use of the graph filter locality and the local graph regularization. Experiments on spherical mesh data, real-world facial expression recognition/skeleton-based action recognition data, and data with simulated graph noise show the empirical advantage of the proposed model. A systematic review can be found in several places, e.g. Wu et al. (2020) . Spectral graph convolution was proposed using full eigen decomposition of the graph Laplacian in Bruna et al. (2013) , Chebyshev polynomial in ChebNet (Defferrard

1. Introduction

Deep methods have achieved great success in visual cognition, yet they still lack capability to tackle severe geometric transformations such as rotation, scaling and viewpoint changes. This problem is often handled by conducting data augmentations with these geometric variations included, e.g. by randomly rotating images, so as to make the trained model robust to these variations. However, this would remarkably increase the cost of training time and model parameters. Another way is to make use of certain underlying structures of objects, e.g. facial landmarks (Chen et al., 2013) and human skeleton landmarks (Vemulapalli et al., 2014a) , c.f. Fig. 1 (right). Nevertheless, these methods then adopt hand-crafted features based on landmarks, which greatly constrains their ability to obtain rich features for downstream tasks. One of the main obstacles for feature extraction is the non-Euclidean property of underlying structures, and particularly, it prohibits the direct usage of prevalent convolutional neural network (CNN) architectures (He et al., 2016; Huang et al., 2017) . Whereas there are recent CNN models designed for non-Euclidean grids, e.g., for spherical mesh (Jiang et al., 2019; Cohen et al., 2018; Coors et al., 2018) and manifold mesh in computer graphics (Bronstein et al., 2017; Fey et al., 2018) , they mainly rely on partial differential operators which only can be calculated precisely on fine and regular mesh, and may not be applicable to the landmarks which are irregular and course. Recent works have also applied Graph Neural Network (GNN) approaches to coarse non-Euclidean data, yet methods using GCN (Kipf & Welling, 2016) may fall short of model capacity, and other methods adopting GAT (Veličković et al., 2017) are mostly heuristic and lacking theoretical analysis. A detailed review is provided in Sec. 1.1. In this paper, we propose a graph convolution model, called L3Net, originating from lowrank graph filter decomposition, c.f. Fig. 1 (left) . The model provides a unified framework for graph convolutions, including ChebNet (Defferrard et al., 2016) , GAT, EdgeNet (Isufi et al., 2020) and CNN/geometrical CNN with low-rank filter as special cases. In addition, we theoretically prove that L3Net is strictly more expressive to represent graph signals than spectral graph convolutions based on global adjacency/graph Laplacian matrices, which is then empirically validated, c.f. Sec. 3.1. We also prove a Lipschitz-type representation stability of the new graph convolution layer using perturbation analysis. Because our model allows neighborhood specialized local graph filters, regularization may be needed to prevent over-fitting, so as to handle changing underlying graph topology and other graph noise, e.g., inaccurately detected landmarks or missing landmark points due to occlusions. Therefore, we also introduce a regularization scheme based on local graph Laplacians, motivated by the eigen property of the latter. This further improves the representation stability aforementioned. The improved performance of L3Net compared to other GNN benchmarks is demonstrated in a series of experiments, and with the the proposed graph regularization, our model shows robustness to a variety of graph data noise. In summary, the contributions of the work are the following: • We propose a new graph convolution model by a low-rank decomposition of graph filters over trainable local basis, which unifies several previous models of both spectral and spatial graph convolutions. • Regularization by local graph Laplacians is introduced to improve the robustness against graph noise. • We provide theoretical proof of the enlarged expressiveness for representing graph signals and the Lipschitz-type input-perturbation stability of the new graph convolution model. • We demonstrate with applications to object recognition of spherical data and facial expression/skeleton-based action recognition using landmarks. Model robustness against graph data noise is validated on both real-world and simulated datasets.

1.1. Related Works

Modeling on face/body landmark data. Many applications in computer vision, such as facial expression recognition (FER) and skeleton-based action recognition, need to extract high-level features from landmarked data which are sampled at irregular grid points on human face or at body joints. While CNN methods (Guo et al., 2016; Ding et al., 2017; Meng et al., 2017) prevail in FER task, landmark methods have the potential advantage in lighter model size as well as more robustness to previously mentioned geometric transformations like pose variation. Earlier methods based on facial landmarks used hand-crafted features (Jeong & Ko, 2018; Morales-Vargas et al., 2019) rather than deep networks. Skeleton-based methods in action recognition have been developed intensively recently (Ren et al., 2020) , including non-deep methods (Vemulapalli et al., 2014b; Wang et al., 2012) and deep methods (Ke et al., 2017; Kim & Reiter, 2017; Liu et al., 2016; Yan et al., 2018) . Facial and skeleton landmarks only give a coarse and irregular grid, and then mesh-based geometrical CNN's are hardly applicable, while previous GNN models on such tasks may lack sufficient expressive power. et al., 2016) , by Cayley polynomials in Levie et al. (2018) . GCN (Kipf & Welling, 2016) , the mostly-used GNN, is a variant of ChebNet using degree-1 polynomial. Liao et al. (2019) accelerated the spectral computation by Lanczos algorithm. Graph scattering transform has been developed using graph wavelets (Zou & Lerman, 2020; Gama et al., 2019b) , which can be constructed in the spectral domain (Hammond et al., 2011) and by diffusion wavelets (Coifman & Maggioni, 2006) . The scattering transform enjoys theoretical properties of the representation but lacks adaptivity compared to trainable neural networks. Spatial graph convolution has been performed by summing up neighbor nodes' transformed features in NN4G (Scarselli et al., 2008) , by graph diffusion process in DCNN (Atwood & Towsley, 2016) , where the graph propagation across nodes is by the adjacency matrix. Graph convolution with trainable filter has also been proposed in several settings: MPNN (Gilmer et al., 2017) enhanced model expressiveness by message passing and sub-network; GraphSage (Hamilton et al., 2017) used trainable differential local aggregator functions in the form of LSTM or mean/max-pooling; GAT (Veličković et al., 2017) and variants (Li et al., 2018; Zhang et al., 2018; Liu et al., 2019) introduced attention mechanism to achieve adaptive graph affinity, which remains non-negative valued; EdgeNet (Isufi et al., 2020) developed adaptive filters by taking products of trainable local filters. Our model learns local filters which can take negative values and contains GAT and EdgeNet as special cases. Theoretically, expressive power of GNN has been studied in Morris et al. (2019) ; Xu et al. (2019); Maron et al. (2019a; b) ; Keriven & Peyré (2019) , mainly focusing on distinguishing graph topologies, while our primary concern is to distinguish signals lying on a graph. CNN and geometrical CNN. Standard CNN applies local filters translated and shared across locations on an Euclidean domain. To extend CNN to non-Euclidean domains, convolution on a regular spherical mesh using geometrical information has been studied in S2CNN (Cohen et al., 2018) , SphereNet (Coors et al., 2018) , SphericalCNN (Esteves et al., 2018) , and UGSCNN (Jiang et al., 2019) , and applied to 3D object recognition, for which other deep methods include 3D convolutional (Qi et al., 2016) and non-convolutional architectures (Qi et al., 2017a; b) . CNN's on manifolds construct weight-sharing across local atlas making use of a mesh, e.g., by patch operator in Masci et al. (2015) , anisotropic convolution in ACNN (Boscaini et al., 2016) , mixture model parametrization in MoNet (Monti et al., 2017) , spline functions in SplineCNN (Fey et al., 2018) , and manifold parallel transport in Schonsheck et al. (2018) . These geometric CNN models use information of non-Euclidean meshes which usually need sufficiently fine resolution. 

2. Method

Y (u, c) = σ( u ∈V,c ∈[C ] M (u , u; c , c)X(u , c ) + bias(c)), u ∈ V, c ∈ [C]. (1) The spatial and spectral graph convolutions correspond to different ways of specifying M , c.f. Sec. 2.3. The proposed graph convolution is defined as M (u , u; c , c) = K k=1 a k (c , c)B k (u , u), a k (c , c) ∈ R, where B k (u , u) is non-zero only when u ∈ N (d k ) u , N u denoting the d-th order neighborhood of u (i.e., the set of d-neighbors of u), and K is a fixed number. In other words, B k 's are K basis of local filters around each u, and the order d k can differ with 1 ≤ k ≤ K. Both a k and B k are trainable, so the number of parameters are K • CC + K k=1 u∈V |N (d k ) u | ∼ K • CC + Knp, where p stands for the average local patch size. In our experiments we use K up to 5, and d k up to 3. We provide the matrix notation of (2) in Appendix A.1. The construction (2) can be used as a layer type in larger GNN architectures. Pooling of graphs can be added between layers, and see Appendix C.5 for further discussion on multiscale model. The choice of K and neighborhood orders (d 1 , • • • , d K ) can also be adjusted accordingly. The model may be extended in several ways to be discussed in the last section. 

2.2. Regularization by local graph Laplacian

The proposed L3Net layer enlarges the model capacity by allowing K basis filters at each location, and a natural way to regularize the trainable filters is by the graph geometry, where, by construction, only the local graph patch is concerned. We introduce the following regularization penalty of the basis filters B k 's as R({B k } k ) = K k=1 u∈V (b (k) u ) T L (k) u b (k) u , b (k) u (v) := B k (v, u), b (k) u : N (d k ) u → R, where L  L({a k , B k } k ) + λR({B k } k ), λ ≥ 0, ( ) where L is the classification loss. As L encourages the diversity of B k 's, the K-rankness usually remains a tight constraint in training, unless λ is very large, see also Proposition 3.

2.3. A unified framework for graph convolutions

Graph convolutions basically fall into two categories, the spatial and spectral constructions (Wu et al., 2020) . The proposed L3Net belongs to spatial construction, and here we show that the model ( 2) is a unified framework for various graph convolutions, both spatial and spectral. Details and proofs are given in Appendix A. • ChebNet (Defferrard et al., 2016) , GAT (Veličković et al., 2017) , EdgeNet (Isufi et al., 2020) : In ChebNet, M per (c , c) equals a degree-(L-1) polynomial of the graph Laplacian matrix, where the polynomial coefficients are trainable. GCN (Kipf & Welling, 2016) can be viewed as ChebNet with polynomial degree-1 and tied coefficients. The attention mechanism in GAT enhances the model expressiveness by incorporating adaptive kernel-based nonnegative affinities. In EdgeNet, the graph convolution operator is the product of trainable local filters supported on order-1 neighborhoods. We have the following proposition: Proposition 1. L3Net (2) includes the following models as special cases: (1) ChebNet (GCN) when K ≥ L (K ≥ 2), L being the polynomial degree. (2) GAT when K ≥ R, R being the number of attention branches. (3) EdgeNet when K ≥ L, L being the order of graph convolutions. • CNN: When nodes lie on a geometrical domain that allows translation (u -u), in (2) setting (Qiu et al., 2018) . Extension to CNN on manifold mesh is also possible as in Masci et al. (2015) ; Fey et al. (2018) . We have the following: Proposition 2. Mesh-based geometrical CNN's defined by linear patch operators, including standard CNN on R d , and with low-rank decomposed filters are special cases of L3Net (2). B k (u , u) = b k (u -u) for some b k (•) enforces spatial convolutional. The convolutional kernel can be decomposed as k a k (c , c)b k (•) We also note that L3Net reduces from locally connected GNN (Coates & Ng, 2011; Bruna et al., 2013) , the largest class of spatial GNN, only by the low-rankness imposed by a small number of K in (2). Locally connected GNN can be viewed as (1) with the requirement that for each (c, c ), M (u , u; c , c) is nonzero only when u is locally connected in u. The complexities of the various models are summarized in Fig. 2 (Table ), where L3Net reduces from the np•CC complexity of locally-connected net to be the additive (np+CC ) times K. When the number of channels C, C are large, e.g. in deep layers they ∼ 10 2 , and the graph size is not large, e.g., in landmark data applications np CC , the complexity is dominated by KCC which is comparable with ChebNet (GAT) if K ≈ L (R). The computational cost is also comparable, as shown in experiments in Sec. 4. Furthermore, we have: Proposition 3. Suppose the subgraphs on N (d k ) u are all connected, given α u,k > 0 for all u, k, the minimum of (3) with constraint b (k) u 2 ≥ α u,k is achieved when b (k) u equals the first Dirichlet eigenvector on N (d k ) u , which does not change sign on N (d k ) u . The proposition shows that in the strong regularization limit of λ → ∞ in (4), L3Net reduces to be ChebNet-like. The constraint with constants α u,k is included because otherwise the minimizer will be B k all zero. The first Dirichlet eigenvector is envelope-like (Fig. 2 ), and then B k (•, u) will be averaging operators on the local patch. Thus the regularization parameter λ can be viewed as trading-off between the more expressiveness in the learnable B k , and the more stability of the averaging local filters, similar to ChebNet and GCN.

3. Analysis

We analyze the representation expressiveness and stability (defined in below) of the proposed L3Net model. All proofs in Appendix A, and experimental details in Appendix B.

3.1. Representation expressiveness of graph signals

The theoretical question of graph signal representation expressiveness concerns the ability for GNN deep features to distinguish graph signals. While related, the problem differs from the graph isomorphism test problem which has been intensively studied in the GNN expressiveness literature. Here we prove that L3Net is strictly more expressive than certain spectral GNNs, and support the theoretical prediction by experiments. We have shown that the L3Net model contains ChebNet (Proposition 1), and the following proposition proves the strictly more expressiveness for graph signal classification. We call B a graph local filter if B(u, v) is non-zero only when v is in the neighborhood of u. In a spectral GNN, the graph convolution takes the form as x → f (A)x where f is a function on R, and A is the (possibly normalized) adjacency matrix. Proposition 4. There is a graph and 1) A local filter B on it such that B cannot be expressed by any spectral graph convolution, but can be expressed by L3Net with K = 1. 2) Two data distributions on the graph (two classes) such that, with a permutation group invariant operator in the last layer, the deep feature of any spectral GNN cannot distinguish the two classes, but that of L3Net with 1 layer and K = 1 can. The fundamental argument is that spectral GNN is permutation equivariant (see e.g. Gama et al. (2019a) , reproduced as Lemma A.1), and the local filters in L3Net break such symmetry to obtain more discriminative power. The constructive example used in the proof is on a ring graph (Fig. A.1, A and the basis B), and the two data distributions shown in Fig. 3 . Proposition 4 gives that, on the ring graph and using GNN with a global pooling in the last layer, an L3Net layer with K = 1 can have classification power while a ChebNet with any order cannot. On a chain graph (removing the connection between two end points in a ring graph), which not exactly follows the theory assumption, since the two graphs only differ at one edge, we expect that it will remain a difficult case for the ChebNet but not for L3Net. To verify the theory, we conduct experiments using a two-layer GNN and the results are in Fig. 3 reduces L3Net to a 1D convolutional layer, and the learned basis shows a "difference" shape (right plot) which explains its classification power. Results are similar using a 1-layer GNN (Tab. A.1). The argument in Proposition 4 extends to other graphs and network types. Generally, when a GNN based on global graph adjacency or Laplacian matrix applies linear combinations of local averaging filters, then certain graph filters may be difficult to express. We experimentally examine GAT, WLN and MPNN, which underperform on the binary classification task, as shown in Fig. 3 (table ).

3.2. Representation stability

We derive perturbation bounds of GNN feature representation, which is important for robustness against data noise. The analysis implies a trade-off between de-noising and keeping high-frequency information, which is consistent with experimental observation in Sec. 4. Consider the change in the GNN layer output Y defined in (1)(2) when the input X changes. For simplicity, let C = C = 1, and the argument extends. For any graph signal x : V → R and V ⊂ V , define x 2,V := ( u∈V x(u) 2 )foot_0/2 and x, y V = u∈V x(u)y(u). The following perturbation bound holds for the L3Net layer with/without regularization. Theorem 1. Suppose that X = {X(u)} u∈V is perturbed to be X = X + ∆X, the activation function σ : R → R is non-expansive, and sup u∈V K k=1 |N (d k ) u | ≤ Kp, then the change in the output {Y (u)} u∈V in 2-norm is bounded by ∆Y 2,V ≤ β (1) • a 2 Kp ∆X 2,V , β (1) := sup k,u B k (•, u) 2,N (d k ) u . Note that p indicates the averaged size of the d k -order local neighborhoods. The proposition implies that when K is O(1), and the local basis B k 's have O(1) 2-norms on all local parches uniformly bounded by β (1) , then the Lipschitz constant of the GNN layer mapping is O(1), i.e., the product of a 2 , β (1) and √ Kp, which does not scale with n. This resembles the generalizes the 2-norm of a convolutional operator which only involves the norm of the convolutional kernel, which is possible due to the local receptive fields in the spatial construction of L3Net. is positive definite whenever the subgraph is connected and not isolated from the whole graph. We then define the weighted 2-norm on local patch x L (k) u := x, L (k) u x N (d k ) u , and similarly x (L (k) u ) -1 . Theorem 2. Notation and setting as in Theorem 1, if furtherly, all the subgraphs on N (d k ) u are connected within itself and to the rest of the graph, and there is ρ ≥ 0 s.t. ∀u, k, ∆X (L (k) u ) -1 ≤ ρ ∆X 2,N (d k ) u , then ∆Y 2,V ≤ ρβ (2) • a 2 Kp ∆X 2,V , β (2) := sup k,u B k (•, u) L (k) u . The bound improves from Theorem 1 when ρβ (2) < β (1) , and regularizing by R = u,k B k (•, u) 2 L (k) u leads to smaller β (2) . Meanwhile, on each N (d k ) u the Dirichlet eigenval- ues increases 0 < λ 1 ≤ λ 2 • • • ≤ λ p u,k , p u,k := |N (d k ) u |, thus weighting by λ -1 l in • (L (k) u ) -1 decreases the contribution from high-frequency eigenvectors. As a result, ρ will be small if ∆X contains a significant high-frequency component on the local patch, e.g., additive Gaussian noise or missing values. Note that in the weighted 2-norm of ∆X by (L (k) u ) -1 , only the relative amount of high-frequency component in ∆X matters (because any constant normalization of L (k) u cancels in the product of ρ and β (2) ). The benefits of local graph regularization in presence of noise in graph data will be shown in experiments.

4. Experiment

We test the proposed L3Net model on several datasets 1 . ) Testing accuracies of sphere MNIST under different mesh settings, (l1; l2; l3) stands for the mesh level used in each GNN layer. L3Net uses K=4, and neighborhood order (1;1;2;3). S2CNN (Cohen et al., 2018) on mesh (4;3;2) has accuracy 96.0.

4.1. Object recognition of data on spherical mesh

We first classify data on a spherical mesh: sphere MNIST and sphere ModelNet-40, following the settings in literature. Though regular mesh on sphere is not the primary application scenario that motivates our model, we include the experiments to compare with benchmarks and test the efficiency of L3Net on such regular meshes. Following UGSCNN (Jiang et al., 2019) , we implement different mesh resolution on a sphere, indicated by "mesh level" (Fig. 4 ), where number of nodes in different levels can vary from 2562 (level 4) to 12 (level 0). All the networks consist of three convolutional layers, see more details in Appendix C.1. Using the original mesh level (4;3;2), the finest resolution as in UGSCNN, L3Net gives among the best accuracies for sphere MNIST. On Modelnet-40, L3Net achieves a testing accuracy of 90.24, outperforming ChebNet and GCN and and is comparable to UGSCNN which uses spherical mesh information (Tab. A.2). When the mesh becomes coarser, as shown in Fig. 4 (Table ), L3Net improves over GCN and ChebNet (L=4) and is comparable with UGSCNN under nearly all mesh settings. We observe that in some settings ChebNet can benefit from larger L, but the overall accuracy is still inferior to L3Net. The most right two columns give two cases of coarse meshes where L3Net shows the most significant advantage.

4.2. Facial expression recognition (FER)

We test on two FER datasets, Extended CohnKanade (CK+) (Lucey et al., 2010) and FER13 (Goodfellow et al., 2013) . We use 15 facial landmarks, see Fig. 1 , and pixel values on a patch around each landmark point as node features. Details about dataset and model setup are in Appendix C.2. Unlike spherical mesh, facial and body landmarks are coarse irregular grids where no clear pre-defined mesh operation is applicable. We benchmark L3Net with other GNN approaches, as shown in Table 1 . The local graph regularization strategy is applied on FER13, due to the severe outlier data of landmark detection caused by occlusion. On CK+, L3Net leads all non-CNN models by a large margin, and the best model (1,1,2,3) uses comparable number of parameters with the best ChebNet (L=4). On FER13, L3Net has lower performance than ChebNet and EdgeNet (Isufi et al., 2020) , but outperforms after adding regularization. The running times of best ChebNet and L3Net models are comparable, and are much less than GAT's. 

4.3. Action recognition

We test on two skeleton-based action recognition datasets, NTU-RGB+D (Shahroudy et al., 2016) and Kinetics-Motion (Kay et al., 2017) . The irregular mesh is the 18/25-point body landmarks, with graph edges defined by body joints, shown in Fig. 1 

4.4. Robustness to graph noise

To examine the robustness to graph noise, we experiment on down-sampled MNIST data on 2D regular grid with 4-nearest-neighbor graph. With no noise, on 28×28 data (Tab. A.3), 14×14 data (Tab. A.4), and 7×7 data (Tab. 3 "original" column), the performance of L3Net is comparable to ChebNet (Defferrard et al., 2016) and EdgeNet (Isufi et al., 2020) and better than other GNN methods. We consider three types of noise, Gaussian noise added to the pixel value, missing nodes or equivalently missing value in image input, and permutation of the node indices, details in Appendix C.4. The results of adding different levels of gaussian noise and permutation noise are shown in Tab. 3, while results of adding missing value noise is provided in Appendix C.4. The results show that our regularization scheme improves the robustness to all three types of graph noise, supporting the theory in Sec. 3.2. Specifically, L3Net without regularization may underperform than ChebNet, but catches up after adding regularization, which is consistent with Proposition 3. (3) Incorporation of edge features. Edge features can be transformed into extra channels of node features by an additional layer in the bottom, and the low-rank graph operation can be similarly employed there. (4) Theoretically, the representation robustness analysis is to be extended to more general types of graph perturbation. Generally, one can work to extend to other types of graph data and tasks. Published as a conference paper at ICLR 2021

A.1.2 ChebNet/GCN, GAT and EdgeNet

• ChebNet/GCN In view of (1), ChebNet (Defferrard et al., 2016) makes use of the graph adjacency matrix to construct M . Specifically, A sym := D -1/2 AD -1/2 is the symmetrized graph adjacency matrix (possibly including self-edge, then A equals original A plus I), and L sym := I -A sym has spectral decomposition L sym = ΨΛΨ T . Let L = α 1 I + α 2 L sym be the rescaled and re-centered graph Laplacian such that the eigenvalues are between [-1, 1], α 1 , α 2 fixed constants. Then, written in n-by-n matrix form, M c ,c = L-1 l=0 θ l (c , c)T l ( L), θ l (c , c) ∈ R, where T l (•) is Chebshev polynomial of degree l. As A sym and then L are given by the graph, only θ l 's are trainable, thus the number of parameters are L • CC . GCN (Kipf & Welling, 2016 ) is a special case of ChebNet. Take L = 2 in (5), and tie the choice of θ 0 and θ 1 , M c ,c = θ(c , c)(α 1 I + α 2 A sym ) =: θ(c , c) Ã, α 1 , α 2 fixed constants, where θ(c , c) is trainable. This factorized form leads to the linear part of the layer-wise mapping as Y = ÃXΘ written in matrix form, where Ã is n-by-n matrix defined as above, X (Y ) is n-by-C (-C) array, Θ is C -by-C matrix. The model complexity is CC which are the parameters in Θ. • GAT In GAT (Veličković et al., 2017) , R being the number of attention heads, the graph convolution operator in one GNN layer can be written as (omitting bias and non-linear mapping) Y = R r=1 A (r) XΘ r , A (r) u,v = e c (r) uv v ∈N (1) u e c (r) uv , c (r) uv = σ((a (r) ) T [W (r) X u , W (r) X v ]), ( ) where {W (r) , a (r) } are the trainable parametrization of attention graph affinity mechanism A (r) , which constructs non-negative affinities between graph nodes u and v adaptively from the input graph node feature X. In particular, A (r) shares sparsity pattern as the graph topology, that is, A (r) (u, u ) = 0 only when u ∈ N (1) u . In the original GAT, Θ r = W (r) C (r) , where C (r) 's are fixed matrices such that the output from r-th head is concatenated into the output Y across r = 1, • • • , R. Variants of GAT adopt channel mixing across heads, e.g. a generalization of GAT in Isufi et al. (2020) uses extra trainable Θ r in (6) independent from W (k) . Isufi et al. ( 2020) also proposed higherorder GAT by considering powers of the affinity matrix A (r) as well as the edge-varying version (c.f. Eqn. (36)(39) in Isufi et al. (2020) ). As this higher-order GAT and the edge-varying counterpart are special cases of the edgy-varying GNN, we cover this case in Proposition 1 3). The model complexity of GAT: In the original GAT where Θ r is tied with W (r) , the number of parameters in one layer is R(C 0 C + 2C 0 ), where R is the number of attention heads, C = C 0 R, and W (r) : R C → R C0 . When Θ r are free from {W (r) , a (r) } in (6), the number of parameters is R(CC + C 0 C + 2C 0 ) ≤ R(2CC + 2C) , where W (r) maps to dimension C 0 and Θ r maps to dimension C. • EdgeNet (Edge-varying GCN) Per Eqn. (1)(8) in Isufi et al. (2020) , the edge-varying GNN layer mapping can be written as Y = L-1 r=0 r k=0 Φ k XΘ r , More generally, CNN's on non-Euclidean domains are constructed when spatial points are sampled on an irregular mesh in R d , e.g., a 2D surface in R 3 . The generalization of ( 10) is by defining the "patch operator" (Masci et al., 2015) which pushes a template filter w on a regular mesh on R d , d being the intrinsic dimensionality of the sampling domain, to the irregular mesh in the ambient space that have coordinates on local charts. Specifically, for a mesh of 2D surface in 3D, d = 2, and w is a template convolutional filter on R 2 . For any local cluster of 3D mesh points N u around a point u, the patch operator P u provides (P u w)(u ) for u ∈ N u by certain interpolation scheme on the local chart. The operator P u is linear in w, and possibly trainable. As a result, in mesh-based geometrical CNN, y(u, c) = c ∈[C ] u (P u w c ,c )(u )x(u , c ), and one can see that in Euclidean space taking (P u w)(u ) = w(u -u) reduces ( 11) to the standard CNN as in (10). In both ( 10) and ( 11), spatial low-rank decomposition of the filters w c ,c can be imposed (Qiu et al., 2018) . This introduces a set of bases {b k } k over space that linearly span the filters  y(u, c) = c ∈[C ] u K k=1 β k,(c ,c) (P u b k )(u )x(u , c ), and similarly for (10). The trainable parameters in ( 12) are β k,(c ,c) and the basis filters b k 's, the former has KCC parameters, and the latter has k p k , where p k is the size of the support of b k in R d . Suppose the average size is p, then the number of parameters is Kp. This gives the total number of parameters as KCC + Kp. Proof of Proposition 2. Since standard CNN is a special case of geometrical CNN 11, we only consider the latter. Assuming low-rank filter decomposition, the convolutional mapping is (12). Comparing to the GNN layer mapping defined in (1), one sees that M (u , u; c , c) = K k=1 β k,(c ,c) (P u b k )(u ), which equals (2) if setting B k (u , u) = (P u b k )(u ) and a k (c , c) = β k,(c ,c) .

A.1.4 Strong regularization limit

Proof of Proposition 3. The constrained minimization of R defined in (3) separates for each u, k, and the minimization of b (k) u is given by min w:N (d k ) u →R w T L (k) u w, s.t. w 2 ≥ α u,k > 0. ( ) For each u, k, the local Dirichlet graph Laplacian L (k) u has eigen-decomposition L (k) u = Ψ (k) u Λ (k) u (Ψ (k) u ) T , where (Ψ (k) u ) T Ψ (k) u = I, and the diagonal entries of Λ (k) u are eigenvalues of L (k) u , which are all ≥ 0 and sorted in increasing order. By the variational property of eigenvalues, the minimizer of w in ( 13) is achieved when w = Ψ (k) u (•, 1), i.e., the eigenvector associated with the smallest eigenvalue of L (k) u . By that the local subgraph is connected, this smallest eigenvalue has single multiplicity, and the eigenvector is the Perron-Frobenius vector which does not change sign. The claim holds for arbitrary α u,k > 0 since eigenvector is defined up to a constant multiplication. Proof of Proposition 4. Part 1): Let the graph be the ring graph with n nodes, and each node has 2 neighbors, n=8 as shown in Fig. 1 (right). We index the nodes as u = 0, . . . , n-1 and allows addition/subtraction of u-v (mod n). Let B be the "difference" filter B(u , u) = 1 when u = u and -1 when u = u + 1. We show that B = f (A) for any f , and in contrast, setting this B as the basis in (2) expresses the filter with K = 1. To prove that B = f (A) for any f , let π u be the permutation of the n nodes such that π u (u + v) = (u -v) for all v, i.e., mirror flip the ring around the node u. By construction, the graph topology of the ring graph is preserved under π u , that is, A πu := π u Aπ T u = A, whether A is the 0/1 value adjacency matrix or the symmetrically normalized one A sym = D -1/2 AD -1/2 (D is constant on diagonal) or other normalized version as long as the relation A πu = A holds. By Lemma A.1 1), for any f : R → R, Same as in 1), by construction A πu = A. Let F (L) be the mapping to the L-th layer spectral GNN feature, for x up an upwind signal, Lemma A.1 2) gives that f (A)π u = f (A πu )π u = π u f (A), F (L) [A]π u x up = F (L) [A πu ]π u x up = π u F (L) [A]x up . The last layer applies group invariant operator U , then U F (L) [A]π u x up = U π u F (L) [A]x up = U F (L) [A]x up , this gives that U F (L) [A]x down dist. = U F (L) [A]π u x up = U F (L) [A]x up , which means that the final output deep feature via U F (L) [A] are statistically the same for the input signals from the two classes. Meanwhile, the difference local filter B in the proof of 1) can extract feature to differentiate the two classes: with Relu activation function, the output feature after one convolutional layer and a global pooling, which is permutation invariant, can be made strictly positive for one class, and zero for the other class. Thus, L3Net with 1 layer and 1 basis suffices to distinguish the X up and X down signals. 

C.2 Facial Expression Recognition

Landmarks setting 15 landmarks are selected from the standard 68 facial landmarks defined in AAM (Cootes et al., 2001) , and edges are connected according to prior information of human face, e.g., nearby landmarks on the eye are connected, see Fig. 1 (left).

Dataset setup

• CK+: The CK+ dataset (Lucey et al., 2010) is the mostly used laboratory-controlled FER dataset (downloaded from: http://www.jeffcohn.net/resources/ ). It contains 327 video sequences from 118 subjects with seven basic expression labels(anger, contempt, disgust, fear, happiness, sadness, and surprise). Every sequence shows a shift from neutral face to the peak expression. Following the commonly used '(static) image-based' methods (Li & Deng, 2020) , we extract the one to three frames in each expression sequence that have peak expression information in the CK+ dataset, and form a dataset with 981 image samples. Every facial image is aligned and resized to (120, 120) with face alignment model (Bulat & Tzimiropoulos, 2017) , and then we use this model again to get facial landmarks. As we describe in Sec. 4.2, we select 15 from 68 facial landmarks and build graph on them. The input feature for each node is an image patch centered at the landmark with size (20, 20), concatenated with the landmark's coordinates, so the total input feature dimension is 402. • FER13: FER13 dataset (Goodfellow et al., 2013) Network architectures. • CK+: GraphConv(402,64)-BN-ReLU-GraphConv(64,128)-BN-ReLU-FC(7), • FER13: GraphConv(66,64)-BN-ReLU-GraphConv(64,128)-BN-ReLU-GraphConv(128,256)-BN-ReLU-FC(7), where GraphConv(feat in, feat out) here can be any type of graph convolution layer, including our L3Net. Training details. • CK+: We use 10-fold cross validation as Ding et al. (2017) . Batch size is set as 16, learning rate is 0.001 which decay by 0.1 if validation loss remains same for last 15 epochs. We choose Adam optimizer and train 100 epochs for each fold validation. • FER13: We report results on test set. Batch size is set as 32, learning rate is 0.0001 which decay 0.1 if validation loss remains same for last 20 epochs. We choose Adam optimizer and train models for 150 epochs. Runtime analysis details. In section 4.2, we report the running time of our L3Net(order 1,1,2,3), 13.02ms, and best ChebNet, 12.56ms, on CK+ dataset, which are comparable. Here, we provide more details about this. The time we use to compare is the time of model finishing inference on validation set with batch size of 16. For each model, we record all validation time usages in all folds and report the average of them. The Runtime analysis is performed on a single NVIDIA TITAN V GPU.

C.3 Skeleton-based Action Recognition

Dataset setup. • NTU-RGB+D: NTU-RGB+D (Shahroudy et al., 2016) 2018) to get 18-point body joints from each frame using OpenPose (Cao et al., 2017) toolkit. Input features for each joint to the Network is (x, y, p), in which x, y are 2D coordinates of the joint, and p is the confidence for localizing the joint. To eliminate the effect of skeletonbased model's inability to recognize objects in clips, we mainly focus on action classes that requires only body movements. Thus, we conduct our experiments on Kinetics-Motion, proposed by Yan et al. (2018) . This is a small dataset that contains 30 action classes strongly related to body motion. Note that there are severe data missing problem in landmark coordinates in Kinetics data, so we also use our regularization scheme in this experiment. Network Architectures. respectively. On G 1 , 10% of pixels which contain the lowest amount of pixel intensities over the dataset are removed, and those nodes are located near the boundary of the canvas. For each node x i in the coarse-grained graph G 2 , a neighborhood consisting of nodes in G 1 is constructed, called N (x i ; G 1 ). A pooling operator computes the feature on x i from those on N (x i ; G 1 ), and the pooled feature is used as the input to the graph convolution on G 2 . A similar graph pooling layer is used from G 2 to G 3 . The graph topology and local neighborhoods are determined by grid point locations. Using a two-layer convolution with graph poolings in between from G 1 to G 3 , and the other setting same as in Table A .4, L3Net obtains 97.33 ± 0.15 test accuracy (basis order 1; 1; 2, with regularization 0.001). We have also applied graph pooling layers on regular image grid on the 28×28 MNIST dataset. The results, reported in Table A .3, show that multi-scale convolution in L3Net not only improves the classification accuracy but also reduces the number of parameters. Graph up-sampling layer can be used similarly. These multi-scale approaches generally apply to graph convolution models, see, e.g., the hierarchical construction originally proposed for locally-connected GNN (Coates & Ng, 2011; Bruna et al., 2013) . There is also flexibility in defining the graph down/up-sampling schemes, and the choice depends on application. An example of graph sampling operator on face mesh data is given in (Ranjan et al., 2018) . At last, apart from using separate down/up-sampling layers, it is also possible to extend the L3Net model (2) to directly implement graph down/up-sampling, which would be similar to the convolution-with-stride (conv-t) operator in standard CNN. Specifically, between G l and G l+1 , the local basis filter B k (u, u ) is defined for u ∈ G l and u ∈ G l+1 , and B k (u, u ) = 0 only when u is in a local neighborhood of u. In matrix notation, B k is of size |G l+1 |-by-|G l |, and is sparse according to the graph local neighborhood relation between G l and G l+1 .



Codes available at https://github.com/ZichenMiao/L3Net.



Figure 1: (a) K-rank graph local filters. Notation as in Sec. 2.1, and specifically, u is node index, c is channel index, k is basis index, and K is number of basis. M is the tensor in the GNN linear mapping (1) (2), decomposed into learnable local basis B k combined by learnable coefficients a k . (b) The first two figures shows the good property of landmarks for being invariant to pose and camera viewpoint changes. The third figure illustrates the graph we built on facial landmarks.

Decomposed local filters Consider an undirected graph G = (V, E), |V | = n. A graph convolution layer maps from input node features X(u , c ) to output Y (u, c), where u, u ∈ V , c ∈ [C ] (c ∈ [C]) is the input (output) channel index, the notation [m] means {1, • • • , m}, and

Figure 2: Plots: (a) Local graph Laplacian Lu := D-A on a neighborhood around node u. (b) Plots of the Dirichlet eigenvectors on the local graph. The first Dirichlet eigenvector does not change sign on Nu and is envelope-like. (Table) Model complexity measured by number of parameters, C and C being the number of input and output channels, p (p (1) ) the average patch size of local neighborhoods (local 1-neighborhoods), see more in Sec. 2.3.

(D -A) restricted to the subgraph on N (d k ) u , is the Dirichlet local graph Laplacian on N (d k ) u (Chung & Graham, 1997) (Fig. 2). The training objective is

Figure 3: Up/down-wind classification. Plots: (a) Example data from two classes. (b) Learned shared basis on the graph neighborhood of 3, corresponding to the last row in the table. (Table)Test accuracy by MPNN(Gilmer et al., 2017), WLN(Morris et al., 2019), ChebNet up to L=30 and L3Net K=1 and 3, as well as GAT with different heads. Last row order 1 with star: L3Net with shared basis B(•, u) across all locations u.

The local graph regularization introduced in Sec. 2.2 improves the stability of Y w.r.t. ∆X by suppressing the response to local high-frequency perturbations in ∆X. Specifically, the local graph Laplacian L (k) u on the subgraph on N (d k ) u

Figure 4: (Plot) Icosahedral spherical meshes at level 2 and 1. (Table)Testing accuracies of sphere MNIST under different mesh settings, (l1; l2; l3) stands for the mesh level used in each GNN layer. L3Net uses K=4, and neighborhood order (1;1;2;3).S2CNN (Cohen et al., 2018)  on mesh (4;3;2) has accuracy 96.0.

w c ,c . For standard CNN in R d , b k are basis filters on R d , and for geometrical CNN, they are defined on the reference domain in R d same as w c ,c , where d is the intrinsic dimension. Suppose w c ,c = K k=1 β k,(c ,c) b k for coefficients β k,(c ,c) , by linearity, (11) becomes

Figure A.1: A ring graph with 8 nodes. Polynomials of graph adjacency matrix A (or Laplacian matrix) preserve symmetry of mirroring around any node, e.g., node 3, and can cannot express a local filter B

this means that if B = f (A) for some f , then Bπ u = π u B, which contradicts with the construction of B.Part 2): Consider the two distributions of graph signals on the ring graph in 1), which we call "upwind/downwind" signals: X up consists of finite superpositions of functions on the ring graph which are periodic, smoothly increasing from 0 to 1 and then dropping to zero. Signals in X up are under certain distribution, and X down consists of the signals that can be produced by mirror-flipping the upwind signals. That is, denoting x up (x down ) an upwind (downwind) signal, π u the permutation as in 1) around any node u, then π u x up dist. = x down , where dist. = means equaling in distribution. Example signals of the two classes as illustrated in Fig. 3.

Figure A.2: Illustration of 25-point body joints and graph.

Figure A.3: Three levels of graphs on a 14×14 image grid. From left to right: the top level G3 (the coarsest) to the bottom level G1 (the finest). The red square (pink circle) indicates the node x on G3 (G2) and its local neighborhood on G2 (G1), on which the graph pooling is applied.



(table). In the last row, we further impose shared basis across nodes which

Results on CK+ and FER13, with comparison to CNN †(Ding et al., 2017), CNN ‡(Guo et al., 2016), landmark method using handcrafted features(Morales-Vargas et al., 2019), and various GNN methods. Specifically, we compare to GAT(Veličković et al., 2017) with different #heads (h) and #features (f).

and Fig.A.2. We adopt ST-GCN(Yan et al., 2018) as the base architecture, and substitute the GCN layer with new L3Net layer, called ST-L3Net. On Kinetics-Motion, we adopt the regularization mechanism to overcome the severe data missing caused by camera out-of-view. See more experimental details in Appendix C.3. We benchmark performance with ST-GCN(Yan et al., 2018), ST-GCN (our implementation without using geometric information) and ST-ChebNet (replacing GCN with ChebNet layer), shown in Table2. L3Net shows significant advantages on two NTU tasks, cross-view and cross-subject settings. On Kinetics-Motion, L3Net regains superiority over other models after applying regularization. The results in both Table1and 2 indicate that stronger regularization sacrifices expressiveness for clean data and gains stability for noisy data, which is consistent with the theory in Sec. 3.2.

Results on MNIST with grid size 7 × 7 with different levels of Gaussian noise and Permutation noise. small underlying graphs, like face/body landmark data in FER and action recognition applications.Limitations and extensions: (1) Scalability to larger graph. When |V | = n is large, the complexity increase in the npK term would be significant. The issue in practice can be remedied by mixing use of layer types, e.g., only adopting L3Net layers in upper levels of mesh which are of reduced size. (2) Dynamically changing underlying graph across samples. For more severe changes of the underlying graph, we can benefit from solutions such as node registration or other preprocessing techniques, possibly by another neural network. Related is the question of reducing model dependence on graph topology, possibly under a statistical model of the underlying graphs. This includes transferability to larger networks.

Table A.2: Results on SphereMNIST and SphereModelNet-40 following setup inJiang et al. (2019)

It contains 28,709 training images, 3589 validation images and 3589 test images of size (48, 48) with seven common expression labels as CK+. We align facial images, get facial landmarks, and select nodes & build graph the same way as we do in CK+. Input features are local image patch centered at each landmark with size (8, 8) and landmark's coordinates, so the total input feature dimension is 66.

is a large skeleton-based action recognition dataset with three-dimensional coordinates given to every body joint (downloaded from: http://rose1.ntu.edu.sg/datasets/requesterAdd.asp?DS=3 ). It comprises 60 action classes and total 56,000 action clips. Every clip is captured by three fixed Kineticsv2 sensors in lab environment performed by one of 40 different subjects. Three sensors are set at same height but in different horizontal views, -45 • , 0 • , 45 • . There are 25 joints tracked, as shown in Fig. A.2. Two experiment setting are proposed by Shahroudy et al. (2016), cross-view (X-view) and cross-subject (X-sub). X-view consists of 37,920 clips for training and 18960 for testing, where training clips are from sensor on 0 • , 45 • , testing clips from sensor on -45 • . X-sub has 40,320 clips for training and 16,560 clips for testing, where training clips are from 20 subjects, testing clips are from the other 20 subjects. We test our model on both settings.

Table A.3: Results on MNIST with grid size 28×28, L3Net-pooling uses graph pooling between convolutional layers. Table A.4: Results on MNIST with grid size 14 × 14 Table A.5: Results on MNIST with grid size 7 × 7 with different levels of missing value

Acknowledgements

The work is supported by NSF (DMS-1820827). XC is partially supported by NIH and the Alfred P. Sloan Foundation. ZM and QQ are partially supported by NSF and the DARPA TAMI program.

A Proofs

A.1 Details and proofs in Sec. 2.3To facilitate comparison with literature, we provide a summary of various graph convolution models in matrix notation, the precise definition of which will be detailed in below. For simplicity, only the linear transform part is shown, and the addition of bias and point-wise non-linearity are omitted.Notation as in Section 2.1, suppose X ∈ R n×C is the input node feature, and Y ∈ R n×C the output feature,• L3Net (ours): Y = K k=1 B k XA k , where B k ∈ R n×n is the local basis filter, and A k ∈ R C ×C are the coefficients, both B k and A k are learnable.• ChebNet/GCN: Y = L-1 l=0 T l ( L)XΘ l , where T l (•)'s are Chebshev polynomials, Lis the rescaled and re-centered graph Laplacian, T l ( L) ∈ R n×n , and Θ l ∈ R C ×C are trainable. • GAT: Y = R r=1 A (r) XΘ r , where A (r) ∈ R n×n is the graph attention affinity computed adaptively from input features, and Θ r ∈ R C ×C are trainable and weightshared with the parameters in A (r) , see more in below.• EdgeNet: Y = L-1 r=0 P r XΘ r , where P r = r k=0 Φ k for a sequence of trainable local filters Φ k , and Θ r ∈ R C ×C are trainable.From the matrix formulation, it can be seen that when B k are the classical graph filtering operators, e.g. polynomials of L, and A k the trainable Θ k , L3Net recovers the above graph convolution models in literature (c.f. Proposition 1). In below we give more details, as well as the reduction to filter-decomposed CNN (c.f. Proposition 2).

A.1.1 Locally connected GNN

Specifically, the construction in Coates & Ng (2011) ; Bruna et al. (2013) assumes that u and u belongs to the graph of different scales, u is on the fine graph, and u is on a coarse-grained layer produced by clustering of indices of the graph of the input layer. If one generalize the construction to allow over-lapping of the receptive fields, and assume no pooling or coarse-graining of the graph, then the non-zero parameters are of the number where Φ 0 is an n-by-n diagonal matrix, and Φ k , k = 1, • • • , r, are supported on N(1) u of each node u. The trainable parameters are {Φ k } R k=0 and {Θ r } R r=0 , Θ r : R C → R C . Edgevarying GAT implements polynomials of averaging filters, and general edge-varying GNN takes product of arbitrary 1-order filters. The proof shows that EdgeNet layer is a special case of L3Net layer, while restricting B k to be of the product form (9) rather than freely supported on NThe trainable parameters: Θ r has LCC many, Φ 0 has n, and Φ k , k = 1, • • • , L -1 each has np (1) many, p (1) being the average size o 1-neighborhood of nodes. Thus the total number of parameters is LCC + n + (L -1)np (1) ∼ L(CC + np (1) ).Proof of Proposition 1. Part (1): Since GCN is a special case of ChebNet, it suffices to prove that (5) can be expressed in the form of L3Net (2) for some K. By definition of L, mathematically equivalently,where the coefficients β l 's are determined by θ l 's, per (c , c). Since A l sym propagates to the l-th order neighborhood of any node, settingPart (2): We consider (6) as the GAT model. Recall that Θ r : R C → R C , then (6) can be re-written in the form of (1) by lettingwhich is a special case of (2) where R = K, A (k) = B k and Θ k = a k . Since A (r) (u, u ) as a function of u is supported on u ∈ N(1) u , (6) belongs to the L3Net model (2) wherein addition to that B k must be of the attention affinity form, i.e. built from the attention coefficients c (r) uv computed from input X via parameters {W (r) , a (r) }. Part (3): Comparing with (1)(2), we have that ( 7) is a special case of L3Net (2) by letting

A.1.3 Standard and geometrical CNN's

Standard CNN on R d , e.g. d = 1 for audio signal and d = 2 for image data, applies a discretized convolution to the input data in each convolutional layer, which can be written as (omitting bias which is added per c, and the non-linear activation)where U is a grid on R d . We write in the way of "anti-convolution", which has "u -u" rather than "u -u ", but the definition is equivalent. For audio and image data, U is usually a regular mesh with evenly sampled grid points, and proper boundary conditions are applied when computing y(u, c) at a boundary grid point u. E.g., boundary can be handled by standard padding as in CNN. As the convolutional filters w c ,c are compactly supported, the summation of u is on a neighborhood of u.Lemma A.1 (Permutation equivariance, Proposition 1 in Gama et al. (2019a) ). Let A be the (possibly normalized) graph adjacency matrix, for any input signal x : V → R, and π ∈ S n a permutation of graph nodes,1) The spectral graph convolution mapping f (A) satisfies that2) Let F (l) [A] be the mapping to the l-th layer spectral GNN feature with graph adjacency A, thenProof of Lemma A.1. Proved in Gama et al. (2019a) and we reproduce with our notation for completeness.Part 1): Denote the n-by-n permutation matrix also by π, then by definition, f (A) = U f (Λ)U T where A = U ΛU T is the diagonalization and U is orthogonal matrix, thusπ T , and this proves 1).Part 2): Each spectral GNN layer mapping adds the bias and the node-wise non-linear activation mapping to the graph convolution linear operator, which preserves the permutation equivariance. Recursively applying to L layers proves 2).we have thatand observe thatwhere we used the assumption on Kp to obtain the last ≤. Then ( 16) continues asProof of Theorem 2. Same as in the proof of Theorem 1, we have ( 14). The eigendecompositionu = I, and, under the connectivity condition of the subgraph, the diagonal entries of Λwhich gives the Cauchy-Schwarz with weighted 2-norm asThen similarly as in ( 16), using the definition of β (2) and the the condition with ρ, we obtain thatand the rest of the proof is the same, which gives thatwhich proves the claim.B Up/down-wind Classification Experiment

B.1 Dataset Setup

We generate the Up/Down wind dataset on both ring graph and chain graph with 64 nodes. Every node is assigned to a probability drawn from (0, 1) uniform distribution. Node with probability less than threshold = 0.1 will be assigned with a gaussian distribution with std = 1.5. Each gaussian distribution added is masked half side. Distribution masked left half is the 'Down Wind' class, distribution masked right half is the 'Up Wind' class, as shown in left plot in Fig. 3 . We then sum up all half distributions from different locations in each sample. We generate 5000 training samples and 5000 testing samples.

B.2 Model architecture and training details

Network architectures.• 2-gcn-layer model:GraphConv(1,32)-ReLU-MaxPool1d(2)-GraphConv(32,64)-ReLU-AvgPool(32)-FC(2),• 1-gcn-layer model:where GraphConv can be ChebNet or L3Net.Training details.We choose the Adam Optimizer, batch size of 100, set initial learning rate of 1 × 10 -3 , make it decay by 0.1 at 80 epoch and train for 100 epochs.

B.3 Additional results

We report additional results using 1-gcn layer architecture in Tab. A.1. Our L3Net again shows stronger classification performance than ChebNet.

C Experimental Details

C.1 Classification of sphere mesh data Spherical mesh We conduct this experiment on icosahedral spherical mesh (Baumgardner & Frederickson, 1985) . Like S2CNN (Cohen et al., 2018) , we project digit image onto surface Here, we details the subdivision scheme of the icosahedral spherical mesh we used. Start with an unit icosahedron, this sphere discretization progressively subdivide each face into four equal triangles, which makes this discretization uniform and accurate. Plus, this scheme provides a natural downsampling strategy for networks, as it denotes the path for aggregating information from higher-level neighbor nodes to lower-level center node. We adopt the following naming convention for different mesh resolution: start with level-0(L0) mesh(i.e., unit icosahedron), each level above is associated with a subdivision. For level-i(L i ), properties of spherical mesh are:in which N f , N e , N v denote number of edges, faces, and vertices.To give a direct illustration of how many nodes each level of mesh has, we list them below,• L0 12 nodes Training Details For SphereMNIST experiments, we use batch size of 64, Adam optimizer, initial learning rate of 0.01 which decays by 0.5 every 10 epochs. We totally train model for 100 epochs.For SphereModelNet-40 experiment, we batch size of 16, Adam optimizer, initial learning rate of 0.005 which decay by 0.7 every 25 epochs. We totally train 300 epochs.

Results on fine mesh

Tab. A.2 show the results of SphereMNIST and Sphere-ModelNet40 on fine meshes on the sphere. Specifically, the mesh used for SphereMNIST here is of levels L4, L3, L2, and the SphereModelNet-40 mesh of levels L5, L4, L3, L2, same as in Jiang et al. (2019) .• NTU-RGB+D:We follow the architecture in Yan et al. (2018): STGraphConv(3,64,9,s1)-STGraphConv(64,64,9,s1)-STGraphConv(64,64,9,s1)-STGraphConv(64,64,9,s1)-STGraphConv(64,128,9,s2)-STGraphConv(128,128,9,s1)-STGraphConv(128,128,9,s1)-STGraphConv(128,256,9,s2)-STGraphConv(256,256,9,s1)-STGraphConv(256,256,9,s1 )-STAvgPool-fc(60).• Kinetics:We also design a computation-efficient architecture for Kinetics- 

Training Details

• NTU-RGB+D:We use batch size of 32, initial learning rate of 0.001 which decay by 0.1 at (30, 80) epoch, and total train 120 epochs. SGD optimizer is selected. We padding every sample temporally with 0 to 300 frames.• Kinetics:We use batch size of 32, initial learning rate of 0.01 which decay by 0.1 at (40, 80) epoch, and total train 100 epochs. SGD optimizer is selected. We padding every sample temporally with 0 to 300 frames, and during training, we perform data augmentation by randomly choosing 150 contiguous frames.C.4 Details of experiment on MNIST C.4.1 Simulated graph noise on 7 × 7 MNIST.Here we describe three types of noise in our experiments:Gaussian noise. Given a 7 × 7 image from MNIST, we sample 49 values from N (0, std 2 ). The std controls the strength of noise added. We conduct experiments under std = 0.1, 0.2, 0.3 as shown in Tab. 3. The amount of noise is also measured by PNSR which is standard for image data.Missing value noise. Given a image, we randomly sample 49 values from U (0, 1), and select nodes with probabilities less than a threshold. This threshold is called noise level, which controls the percentage of nodes affected. Then, we remove the pixel value at those selected nodes. Experiments with noise level = 0.1, 0.2, 0.3 are conducted.Graph node permutation noise. For each sample, we randomly select a permutation center node which has exact 4 neighbors. Then, we rotate its neighbors clockwise by 90 degree, e.g., top neighbor becomes right neighbor, and then we update the indices of permuted nodes.

C.4.2 Network architecture and training details

We use the same architecture for different experiment settings:GraphConv(1,32)-BN-ReLU-GraphConv(32,64)-BN-ReLU-FC(10),where GraphConv can be different types of graph convolution layers. We set batch size to 100, use Adam optimizer, and set initial learning rate to 1e-3. Learning rate will drop by 10 if the least validation loss remains the same for the last 15 epochs. We set total training epochs as 200. We use 10,000 images for training.We also adopt graph pooling layers in the above architecture:GraphConv(1,32)-BN-ReLU-Graph Pooling-GraphConv(32, 64)-BN-ReLU-Graph Pooling-FC(10).More discussion about graph pooling layer and multi-scale graph convolution is detailed in Appendix C.5.

C.4.3 Additional results

Here, we show experiments results on 28 × 28, 14 × 14 grid, as well as 7 × 7 grid with missing values. Tab. A.3 shows results on 28 × 28 image grid. Our model have better performance than other methods.Tab. A.4 shows results on 14 × 14 image grid, where our L3Net have comparable results with the best ChebNet (Defferrard et al., 2016) method.We shows our results on 7 × 7 image grid with missing values in Tab. A.5. With regularization, L3Net achieves the best performance in every experiment with different noise levels.

C.5 Multi-scale graph convolution

The proposed L3Net graph convolution model ( 2) is compatible with graph down/upsampling schemes to achieve multi-scale feature extraction.The graph down/up-sampling is usually implemented as a separate layer between graph convolution layers. As an example, 

