DATA-DRIVEN LEARNING OF GEOMETRIC SCATTERING NETWORKS

Abstract

Many popular graph neural network (GNN) architectures, which are often considered as the current state of the art, rely on encoding graph structure via smoothness or similarity between neighbors. While this approach performs well on a surprising number of standard benchmarks, the efficacy of such models does not translate consistently to more complex domains, such as graph data in the biochemistry domain. We argue that these more complex domains require priors that encourage learning of longer range features rather than oversmoothed signals of standard GNN architectures. Here, we propose an alternative GNN architecture, based on a relaxation of recently proposed geometric scattering transforms, which consists of a cascade of graph wavelet filters. Our learned geometric scattering (LEGS) architecture adaptively tunes these wavelets and their scales to encourage band-pass features to emerge in learned representations. This results in a simplified GNN with significantly fewer learned parameters compared to competing methods. We demonstrate the predictive performance of our method on several biochemistry graph classification benchmarks, as well as the descriptive quality of its learned features in biochemical graph data exploration tasks. Our results show that the proposed LEGS network matches or outperforms popular GNNs, as well as the original geometric scattering construction, while retaining certain mathematical properties of its handcrafted (nonlearned) design.

1. INTRODUCTION

Geometric deep learning has recently emerged as an increasingly prominent branch of machine learning in general, and deep learning in particular (Bronstein et al., 2017) . It is based on the observation that many of the impressive achievements of neural networks come in applications where the data has an intrinsic geometric structure which can be used to inform network design and training procedures. For example, in computer vision, convolutional neural networks use the spatial organization of pixels to define convolutional filters that hierarchically aggregate local information at multiple scales that in turn encode shape and texture information in data and task-driven representations. Similarly, in time-series analysis, recurrent neural networks leverage memory mechanisms based on the temporal organization of input data to collect multiresolution information from local subsequences, which can be interpreted geometrically via tools from dynamical systems and spectral analysis. While these examples only leverage Euclidean spatiotemporal structure in data, they exemplify the potential benefits of incorporating information about intrinsic data geometry in neural network design and processing. Indeed, recent advances have further generalized the utilization of geometric information in neural networks design to consider non-Euclidean structures, with particular interest in graphs that represent data geometry, either directly given as input or constructed as an approximation of a data manifold. At the core of geometric deep learning is the use of graph neural networks (GNNs) in general, and graph convolutional networks (GCNs) in particular, which ensure neuron activations follow the geometric organization of input data by propagating information across graph neighborhoods (Bruna et al., 2014; Defferrard et al., 2016; Kipf & Welling, 2016; Hamilton et al., 2017; Xu et al., 2019; Abu-El-Haija et al., 2019) . However, recent work has shown the difficulty in generalizing these methods to more complex structures, identifying common problems and phrasing them in terms of oversmoothing (Li et al., 2018) , oversquashing (Alon & Yahav, 2020) or under-reaching (Barceló et al., 2020) . Using graph signal processing terminology from Kipf & Welling (2016) , these issues can be partly attributed to the limited construction of convolutional filters in many commonly used GCN architectures. Inspired by the filters learned in convolutional neural networks, GCNs consider node features as graph signals and aim to aggregate information from neighboring nodes. For example, Kipf & Welling (2016) presented a typical implementation of a GCN with a cascade of averaging (essentially low pass) filters. We note that more general variations of GCN architectures exist (Defferrard et al., 2016; Hamilton et al., 2017; Xu et al., 2019) , which are capable of representing other filters, but as investigated in Alon & Yahav (2020) , they too often have difficulty in learning long range connections. Recently, an alternative approach was presented to provide deep geometric representation learning by generalizing Mallat's scattering transform (Mallat, 2012) , originally proposed to provide a mathematical framework for understanding convolutional neural networks, to graphs (Gao et al., 2019; Gama et al., 2019a; Zou & Lerman, 2019) and manifolds (Perlmutter et al., 2018) . Similar to traditional scattering, which can be seen as a convolutional network with nonlearned wavelet filters, geometric scattering is defined as a GNN with handcrafted graph filters, typically constructed as diffusion wavelets over the input graph (Coifman & Maggioni, 2006) , which are then cascaded with pointwise absolute-value nonlinearities. This wavelet cascade results in permutation equivariant node features that are typically aggregated via statistical moments over the graph nodes, as explained in detail in Sec. 2, to provide a permutation invariant graph-level representation. The efficacy of geometric scattering features in graph processing tasks was demonstrated in Gao et al. (2019) , with both supervised learning and data exploration applications. Moreover, their handcrafted design enables rigorous study of their properties, such as stability to deformations and perturbations, and provides a clear understanding of the information extracted by them, which by design (e.g., the cascaded band-pass filters) goes beyond low frequencies to consider richer notions of regularity (Gama et al., 2019b; Perlmutter et al., 2019) . However, while graph scattering transforms provide effective universal feature extractors, their rigid handcrafted design does not allow for the automatic task-driven representation learning that naturally arises in traditional GNNs. To address this deficiency, recent work has proposed a hybrid scattering-GCN (Min et al., 2020) model for obtaining node-level representations, which ensembles a GCN model with a fixed scattering feature extractor. In Min et al. (2020) , integrating channels from both architectures alleviates the well-known oversmoothing problem and outperforms popular GNNs on node classification tasks. Here, we focus on improving the geometric scattering transform by learning, in particular its scales. We focus on whole-graph representations with an emphasis on biochemical molecular graphs, where relatively large diameters and non-planar structures usually limit the effectiveness of traditional GNNs. Instead of the ensemble approach of Min et al. (2020) , we propose a native neural network architecture for learned geometric scattering (LEGS), which directly modifies the scattering architecture from Gao et al. (2019) ; Perlmutter et al. (2019) , via relaxations described in Sec. 3, to allow a task-driven adaptation of its wavelet configuration via backpropagation implemented in Sec. 4. We note that other recent graph spectrum-based methods approach the learning of long range connections by approximating the spectrum of the graph with the Lancoz algorithm Liao et al. (2019) , or learning in block Krylov subspaces Luan et al. (2019) . Such methods are complementary to the work presented here, in that their spectral approximation can also be applied in the computation of geometric scattering when considering very long range scales (e.g., via spectral formulation of graph wavelet filters). However, we find that such approximations are not necessary in the datasets considered here and in other recent work focusing on whole-graph tasks, where direct computation of polynomials of the Laplacian is sufficient. The resulting learnable geometric scattering network balances the mathematical properties inherited from the scattering transform (as shown in Sec. 3) with the flexibility enabled by adaptive representation learning. The benefits of our construction over standard GNNs, as well as pure geometric scattering, are discussed and demonstrated on graph classification and regression tasks in Sec. 5. In particular, we find that our network maintains the robustness to small training sets present in graph scattering while improving classification on biological graph classification and regression tasks, and we show that in tasks where the graphs have a large diameter relative to their size, learnable scattering features improve performance over competing methods.

2. PRELIMINARIES: GEOMETRIC SCATTERING FEATURES

Let G = (V, E, w) be a weighted graph with V := {v 1 , . . . , v n } the set of nodes, E ⊂ {{v i , v j } ∈ V × V, i = j} the set of (undirected) edges and w : E → (0, ∞) assigning (positive) edge weights to the graph edges. Note that w can equivalently be considered as a function of V × V , where we set the weights of non-adjacent node pairs to zero. We define a graph signal as a function x : V → R on the nodes of G and aggregate them in a signal vector x ∈ R n with the i th entry being x[v i ]. We define the weighted adjacency matrix W ∈ R n×n of the graph G as W [v i , v j ] := w(v i , v j ) if {v i , v j } ∈ E 0 otherwise , and the degree matrix D ∈ R n×n of G as D := diag(d 1 , . . . , d n ) with d i := deg(v i ) := n j=1 W [v i , v j ] being the degree of the node v i . The geometric scattering transform (Gao et al., 2019) relies on a cascade of graph filters constructed from a row stochastic diffusion matrix P := 1 2 I n + W D -1 , which corresponds to transition probabilities of a lazy random walk Markov process. The laziness of the process signifies that at each step it has equal probability of either staying at the current node or transitioning to a neighbor, where transition probabilities in the latter case are determined by (normalized) edge weights. Scattering filters are then defined via the graph-wavelet matrices Ψ j ∈ R n×n of scale j ∈ N 0 , as Ψ 0 := I n -P , Ψ j := P 2 j-1 -P 2 j = P 2 j-1 I n -P 2 j-1 , j ≥ 1. These diffusion wavelet operators partition the frequency spectrum into dyadic frequency bands, which are then organized into a full wavelet filter bank W J := {Ψ j , Φ J } 0≤j≤J , where Φ J := P 2 J is a pure low-pass filter, similar to the one used in GCNs. It is easy to verify that the resulting wavelet transform is invertible, since a simple sum of filter matrices in W J yields the identity. Moreover, as discussed in Perlmutter et al. (2019) , this filter bank forms a nonexpansive frame, which provides energy preservation guarantees as well as stability to perturbations, and can be generalized to a wider family of constructions that encompasses the variations of scattering transforms on graphs from Gama et al. (2019a; b) and Zou & Lerman (2019) . Given the wavelet filter bank W J , node-level scattering features are computed by stacking cascades of bandpass filters and element-wise absolute value nonlinearities to form U p x := Ψ jm |Ψ jm-1 . . . |Ψ j2 |Ψ j1 x|| . . . |, indexed (or parametrized) by the scattering path p := (j 1 , . . . , j m ) ∈ ∪ m∈N N m 0 that determines the filter scales captured by each scattering coefficient. Then, a whole-graph scattering representation is obtained by aggregating together node-level features via statistical moments over the nodes of the graph (Gao et al., 2019) . This construction yields the geometric scattering features S p,q x := n i=1 |U p x[v i ]| q . (3) indexed by the scattering path p and moment order q. Finally, we note that it can be shown that the graph-level scattering transform S p,q guarantees node-permutation invariance, while U p is permutation equivariant (Perlmutter et al., 2019; Gao et al., 2019) .

3. RELAXED GEOMETRIC SCATTERING CONSTRUCTION TO ALLOW TRAINING

The geometric scattering construction, described in Sec. 2, can be seen as a particular GNN with handcrafted layers, rather than learned ones. This provides a solid mathematical framework for understanding the encoding of geometric information in GNNs, as shown in Perlmutter et al. (2019) , while also providing effective unsupervised graph representation learning for data exploration, which also has some advantages even in supervised learning task, as shown in Gao et al. (2019) . While the handcrafted design in Perlmutter et al. (2019) ; Gao et al. (2019) is not a priori amenable to task-driven tuning provided by end-to-end GNN training, we note that the cascade in Eq. 3 does conform to a neural network architecture suitable for backpropagation. Therefore, in this section, we show how and under what conditions a relaxation of the laziness of the random walk and the selection of the scales preserves some of the useful mathematical properties established in Perlmutter et al. (2019) . We then establish in section 5 the empirical benefits of learning the diffusion scales over a purely handcrafted design. We first note that the construction of the diffusion matrix P that forms the lowpass filter used in the fixed scattering construction can be relaxed to encode adaptive laziness by setting P α := αI n + (1 -α)W D -1 . Where α ∈ [1/2, 1 ) controls the reluctance of the random walk to transition from one node to another. α = 1/2 gives an equal probability to stay in the same node as to transition to one of its neighbors. At this point, we note that one difference between the diffusion lowpass filter here and the one typically used in GCN and its variation is the symmetrization applied in Kipf & Welling (2016) . However, Perlmutter et al. (2019) established that for the original construction, this is only a technical difference since P can be regarded as self-adjoint under an appropriate measure which encodes degree variations in the graph. This is then used to generate a Hilbert space L 2 (G, D -1/2 ) of graph signals with inner product x, y D -1/2 := D -1/2 x, D -1/2 y . The following lemma shows that a similar property is retained for our adaptive lowpass filter P α . Lemma 1. The matrix P α is self-adjoint on the Hilbert space L 2 (G, D -1/2 ) from Perlmutter et al. (2019) . We note that the self-adjointness shown here is interesting, as it links models that use symmetric and asymmetric versions of the Laplacian or adjacency matrix. Namely, Lemma 1 shows that the diffusion matrix P (which is column normalized but not row normalized) is self-adjoint, as an operator, and can thus be considered as "symmetric" in a suitable inner product space, thus establishing a theoretical link between these design choices. As a second relaxation, we propose to replace the handcrafted dyadic scales in Eq. 1 with an adaptive monotonic sequence of integer diffusion time scales 0 < t 1 < • • • < t J , which can be selected or tuned via training. Then, an adaptive filter bank is constructed as W J := {Ψ j , Φ J } J-1 j=0 , with Φ J := P t J α , Ψ 0 := I n -P t1 α , Ψ j := P tj α -P tj+1 α , 1 ≤ j ≤ J -1. The following theorem shows that for any selection of scales, the relaxed construction of W J constructs a nonexpansive frame, similar to the result from Perlmutter et al. (2019) shown for the original handcrafted construction. Theorem 1. There exist a constant C > 0 that only depends on t 1 and t J such that for all x ∈ L 2 (G, D -1/2 ), C x 2 D -1/2 Φ J x 2 D -1/2 + J j=0 Ψ j x 2 D -1/2 x 2 D -1/2 , where the norm considered here is the one induced by the space L 2 (G, D -1/2 ). Intuitively, the upper (i.e., nonexpansive) frame bound implies stability in the sense that small perturbations in the input graph signal will only result in small perturbations in the representation extracted by the constructed filter bank. Further, the lower frame bound ensures certain energy preservation by the constructed filter bank, thus indicating the nonexpansiveness is not implemented in a trivial fashion (e.g., by constant features independent of input signal). In the next section we leverage the two relaxations described here to design a neural network architecture for learning the configuration α, t 1 , . . . , t J of this relaxed construction via backpropagation through the resulting scattering filter cascade. The following theorem establishes that for any such configuration, extracted from W J via Eqs. 2-3, is permutation equivariant at the node-level and permutation invariant at the graph level. This guarantees that the extracted (in this case learned) features indeed encode intrinsic graph geometry rather than a priori indexation. Theorem 2. Let U p and S p,q be defined as in Eq. 2 and 3 (correspondingly), with the filters from W J with an arbitrary configuration 0 < α < 1, 0 < t 1 < • • • < t J . Then, for any permutation Π over the nodes of G, and any graph signal x ∈ L 2 (G, D -1/2 ) U p Πx = ΠU p x and S p,q Πx = S p,q x p ∈ ∪ m∈N N m 0 , q ∈ N where geometric scattering implicitly considers here the node ordering supporting its input signal. We note that the results in Lemma 1 and Theorems 1-2, as well as their proofs, closely follow the theoretical framework proposed by Perlmutter et al. (2019) . We carefully account here for the relaxed learned configuration, which replaces the originally handcrafted configuration there. For completeness, the adjusted proofs appear in Sec. A of the Appendix. 

4. LEARNABLE GEOMETRIC SCATTERING NETWORK ARCHITECTURE

In order to implement the relaxed geometric scattering construction (Sec. 3) via a trainable neural network, throughout this section, we consider an input graph signal x ∈ R n or, equivalently, a collection of graph signals X ∈ R n×N -1 . The propagation of these signals can be divided into three major modules. First, a diffusion module implements the Markov process that forms the basis of the filter bank and transform, while allowing learning of the laziness parameter α. Then, a scattering module implements the filters and the corresponding cascade, while allowing the learning of the scales t 1 , . . . , t J . Finally, the aggregation module collects the extracted features to provide a graph and produces the task-dependent output. Building a diffusion process. We build a set of m ∈ N subsequent diffusion steps of the signal x by iteratively multiplying the diffusion matrix P α to the left of the signal, resulting in P α x, P 2 α x, P 3 α x, . . . , P m α x , Since P α is often sparse, for efficiency reasons these filter responses are implemented via an RNN structure consisting of m RNN modules. Each module propagates the incoming hidden state h t-1 , t = 1, . . . , m with P α with the readout o t equal to the produced hidden state, h t := P α h t-1 , o t := h t . Our architecture and theory enable the implementation of either trainable or nontrainable α, which we believe will be useful for future work as indicated, for example, in Gao & Ji (2019) . However, in the applications considered here (see Sec. 5), we find that training α made training unstable and did not improve performance. Therefore, for simplicity, we leave it fixed as α = 1/2 for the remainder of this work. In this case, the RNN portion of the network contains no trainable parameters, thus speeding up the computation, but still enables a convenient gradient flow back to the model input. Learning diffusion filter bank. Next, we consider the selection of J ≤ m diffusion scales for the relaxed filter bank construction with the wavelets defined according to Eq. 5. We found this was the most influential part of the architecture. We experimented with methods of increasing flexibility: 1. Selection of {t j } J-1 j=1 as dyadic scales (as in Sec. 2 and Eq. 1), fixed for all datasets (LEGS-FIXED), 2. Selection of each t j using softmax and sorting by j, learnable per model (LEGS-FCN and LEGS-RBF, depending on output layer explained below). For the softmax selection, we use a selection matrix F ∈ R J×m , where each row F (j,•) , j = 1, . . . , J is dedicated to identifying the diffusion scale of the wavelet P tj α via a one-hot encoding. This is achieved by setting F := softmax(Θ) = [softmax(θ 1 ), softmax(θ 2 ), . . . , softmax(θ J )] T where θ j ∈ R m constitute the rows of the trainable weight matrix Θ. While this construction may not strictly guarantee an exact one-hot encoding, we assume that the softmax activations yield a sufficient approximation. Further, without loss of generality, we assume that the rows of F are ordered according to the position of the leading "one" activated in every row. In practice, this can be easily enforced by reordering the rows. We now construct the filter bank W F := { Ψ j , Φ J } J-1 j=0 with the filters Φ J x = m t=1 F (J,t) P t α x, Ψ 0 x = I n - m t=1 F (1,t) P t α x Ψ j x = m t=1 F (j,t) P t α x -F j+1,t P t α x 1 ≤ j ≤ J -1 matching and implementing the construction of W J from Eq. 4. Aggregating and classifying scattering features. While many approaches may be applied to aggregate node-level features into graph-level features such as max, mean, sum pooling, and the more powerful TopK (Gao & Ji, 2019) or attention pooling (Veličković et al., 2018) , we follow the statistical-moment aggregation explained in Secs. 2-3 (motivated by Gao et al., 2019; Perlmutter et al., 2019) and leave exploration of other pooling methods to future work. As shown in Gao et al. (2019) on graph classification, this aggregation works particularly well in conjunction with support vector machines (SVMs) based on the radial basis function (RBF) kernel. Here, we consider two configurations for the task-dependent output layer of the network, either using a small neural network with two fully connected layers, which we denote LEGS-FCN, or using a modified RBF network (Broomhead & Lowe, 1988) , which we denote LEGS-RBF, to produce the final classification. The latter configuration more accurately processes scattering features as shown in Table 2 . Our RBF network works by first initializing a fixed number of movable anchor points. Then, for every point, new features are calculated based on the radial distances to these anchor points. In previous work on radial basis networks these anchor points were initialized independent of the data. We found that this led to training issues if the range of the data was not similar to the initialization of the centers. Instead, we first use a batch normalization layer to constrain the scale of the features and then pick anchors randomly from the initial features of the first pass through our data. This gives an RBF-kernel network with anchors that are always in the range of the data. Our RBF layer is then RBF(x) = φ( BatchNorm(x) -c ) with φ(x) = e -x 2 . Unlike other types of data, these datasets do not exhibit the small-world structure of social datasets and may have large graph diameters for their size. Further, the connectivity patterns of biomolecules are very irregular due to 3D folding and long range connections, and thus ordinary local node aggregation methods may miss such connectivity differences.

5.1. WHOLE GRAPH CLASSIFICATION

We perform whole graph classification by using eccentricity and clustering coefficient as node features as is done in Gao et al. (2019) . We compare against graph convolutional networks (GCN) (Kipf & Welling, 2016) , GraphSAGE (Hamilton et al., 2017) , graph attention network (GAT) (Veličković et al., 2018) , graph isomorphism network (GIN) (Xu et al., 2019) , Snowball network (Luan et al., 2019) , and fixed geometric scattering with a support vector machine classifier (GS-SVM) as in Gao et al. (2019) , and a baseline which is a 2-layer neural network on the features averaged across nodes (disregarding graph structure). These comparisons are meant to inform when including learnable graph scattering features are helpful in extracting whole graph features. Specifically, we are interested in the types of graph datasets where existing graph neural network performance can be improved upon with scattering features. We evaluate these methods across 7 benchmark biochemical datasets: DD, ENZYMES, MUTAG, NCI1, NCI109, PROTEINS, and PTC where the goal is to classify between two or more classes of compounds with hundreds to thousands of graphs and tens to hundreds of nodes (See Table 1 ). For completeness we also show results on six social network datasets in Table S2 . For more specific information on individual datasets see Appendix B. We use 10-fold cross validation on all models which is elaborated on in Appendix C. For an ensembling comparison to Scattering-GCN (Min et al., 2020) see Appendix D. LEGS outperforms on biological datasets. A somewhat less explored domain for GNNs is in biochemical graphs that represent molecules and tend to be overall smaller and less connected (see Tables 1 and S1 ) than social networks. In particular we find that LEGSNet outperforms other methods by a significant margin on biochemical datasets with relatively small but high diameter graphs (NCI1, NCI109, ENZYMES, PTC), as shown in Table 2 . On extremely small graphs we find that GS-SVM performs best, which is expected as other methods with more parameters can easily overfit the data. We reason that the performance increases exhibited by LEGSNet, and to a lesser extent GS-SVM, on these chemical and biological benchmarks is due the ability of geometric scattering to compute complex connectivity features via its multiscale diffusion wavelets. Thus, methods that rely on a scattering construction would in general perform better, with the flexibility and trainability LEGSNet giving it an edge on most tasks. LEGS performs consistently on social network datasets. On the social network datasets LEGSNet performs consistently well, although its benefits here are not as clear as in the biochemical datasets. Ignoring the fixed scattering transform GS-SVM, which was tuned in Gao et al. (2019) with a focus on these particular social network datasets, a version of LEGSNet is best on three out of the six social datasets and second best on the other three. Since the advantages are clearer in the biochemical domain, we focus on this in the remainder of this section. However, for completeness, we provide results on social network datasets in Table S2 , and leave further discussion to Appendix B.1. LEGS preserves enzyme exchange preferences while increasing performance. One advantage of geometric scattering over other graph embedding techniques lies in the rich information present within the scattering feature space. This was demonstrated in Gao et al. (2019) where it was shown that the embeddings created through fixed geometric scattering can be used to accurately infer inter-graph relationships. Scattering features of enzyme graphs within the ENZYMES dataset (Borgwardt et al., 2005) possessed sufficient global information to recreate the enzyme class exchange preferences observed empirically by Cuesta et al. (2015) , using only linear methods of analysis, and despite working with a much smaller and artificially balanced dataset. We demonstrate here that LEGSNet retains similar descriptive capabilities, as shown in Figure 2 via chord diagrams where each exchange preference between enzyme classes (estimated as suggested in Gao et al., 2019) is represented as a ribbon of the corresponding size. Our results here (and in Table S5 , which provides complementary quantitative comparison) show that, with relaxations on the scattering parameters, LEGS-FCN achieves better classification accuracy than both LEGS-FIXED and GCN (see Table 1 ) while also retaining a more descriptive embedding that maintains the global structure of relations between enzyme classes. We ran two varieties of LEGSNet on the EN-ZYMES dataset: LEGS-FIXED and LEGS-FCN, which allows the diffusion scales to be learned. For comparison, we also ran a standard GCN whose graph embeddings were obtained via mean pooling. To infer enzyme exchange preferences from their embeddings, we followed Gao et al. (2019) in defining the distance from an enzyme e to the enzyme class EC j as dist(e, EC j ) := v eproj Cj (v e ) , where v i is the embedding of e, and C j is the PCA subspace of the enzyme feature vectors within EC j . The distance between the enzyme classes EC i and EC j is the average of the individual distances, mean{dist(e, EC j ) : e ∈ EC i }. From here, the affinity between two enzyme classes is computed as pref(EC i , EC j ) = w i / min( Di,i Di,j , Dj,j Dj,i ), where w i is the percentage of enzymes in class i which are closer to another class than their own, and D i,j is the distance between EC i and EC j . Robustness to reduced training set size. We remark that similar to the robustness shown in (Gao et al., 2019) for handcrafted scattering, LEGSNet is able to maintain accuracy even when the training set size is shrunk to as low as 20% of the dataset, with a median decrease of 4.7% accuracy as when 80% of the data is used for training, as discussed in the supplement (see Table S3 ). We next evaluate learnable scattering on two graph regression tasks, the QM9 (Gilmer et al., 2017; Wu et al., 2018) graph regression dataset, and a new task from the critical assessment of structure prediction (CASP) challenge (Moult et al., 2018) . On the CASP task, the main objective is to score protein structure prediction/simulation models in terms of the discrepancy between their predicted structure and the actual structure of the protein (which is known a priori). The accuracy of such 3D structure predictions are evaluated using a variety of metrics, but we focus on the global distance test (GDT) score (Modi et al., 2016) . The GDT score measures the similarity between tertiary structures of two proteins with amino-acid correspondence. A higher score means two structures are more similar. For a set of predicted 3D structures for a protein, we would like to score their quality as quantified by the GDT score.

5.2. GRAPH REGRESSION

For this task we use the CASP12 dataset (Moult et al., 2018) and preprocess the data similarly to Ingraham et al. (2019) , creating a KNN graph between proteins based on the 3D coordinates of each amino acid. From this KNN graph we regress against the GDT score. We evaluate on 12 proteins from the CASP12 dataset and choose random (but consistent) splits with 80% train, 10% validation, and 10% test data out of 4000 total structures. We are only concerned with structure similarity so use no non-structural node features. LEGSNet outperforms on all CASP targets Across all CASP targets we find that LEGSNet significantly outperforms GNN and baseline methods (See Table S4 ). This performance improvement is particularly stark on the easiest structures (measured by average GDT) but is consistent across all structures. In Figure 3 we show the relationship between percent improvement of LEGSNet over the GCN model and the average GDT score across the target structures. We draw attention to target t0879, where LEGSNet shows the greatest improvement over other methods. This target has long range dependencies (Ovchinnikov et al., 2018) as it exhibits metal coupling (Li et al., 2015) creating long range connections over the sequence. Since other methods are unable to model these long range connections LEGSNet is particularly important on these more difficult to model targets. LEGSNet outperforms on the QM9 dataset We evaluate the performance of LEGSNet on the quantum chemistry dataset QM9 (Gilmer et al., 2017; Wu et al., 2018) , which consists of 130,000 molecules with ∼18 nodes per molecule. We use the node features from Gilmer et al. (2017) , with the addition of eccentricity and clustering coefficient features, and ignore the edge features. We whiten all targets to have zero mean and unit standard deviation. We train each network against all 19 targets and evaluate the mean squared error on the test set with mean and std. over four runs. We find that learning the scales improves the overall MSE, and particularly improves the results over difficult targets (see Table 4 for overall results and Table S7 for results by target). Indeed, on more difficult targets (i.e., those with large test error) LEGS-FCN is able to perform better, where on easy targets GIN is the best. Overall, scattering features offer a robust signal over many targets, and while perhaps less flexible (by construction), they achieve good average performance with significantly fewer parameters.

6. CONCLUSION

In this work we have established a relaxation from fixed geometric scattering with strong guarantees to a more flexible network with better performance by learning data dependent scales. Allowing the network to choose data-driven diffusion scales leads to improved performance particularly on biochemical datasets, while keeping strong guarantees on extracted features. This parameterization has advantages in representing long range connections with a small number of weights, which are necessary in complex biochemical data. This also opens the possibility to provide additional relaxation to enable node-specific or graph-specific tuning via attention mechanisms, which we regard as an exciting future direction, but out of scope for the current work.

APPENDIX A PROOFS FOR SECTION 3

A.1 PROOF OF LEMMA 1 Let M α = D -1/2 P α D 1/2 then it can be verified that M α is a symmetric conjugate of P α , and by construction is self-adjoint with respect to the standard inner product of L 2 (G). Let x, y ∈ L 2 (G, D -1/2 ) then we have P α x, y D -1/2 = D -1/2 P α x, D -1/2 y = D -1/2 D 1/2 M α D -1/2 x, D -1/2 y = M α D -1/2 x, D -1/2 y = D -1/2 x, M α D -1/2 y = D -1/2 x, D -1/2 D 1/2 M α D -1/2 y = D -1/2 x, D -1/2 P α y = x, P α y D -1/2 , which gives the result of the lemma.

A.2 PROOF OF THEOREM 1

As shown in the previous proof (Sec. A.1), P α has a symmetric conjugate M α . Given the eigendecomposition M α = QΛQ T , we can write P t α = D 1/2 QΛ t Q T D -1/2 , giving the eigendecomposition of the propagated diffusion matrices. Furthermore, it can be verified that the eigenvalues on the diagonal of Λ are nonnegative. Briefly, this results from graph Laplacian eigenvalues being within the range [0, 1], which means those of W D -1 are in [-1, 1], which combined with 1/2 ≤ α ≤ 1 result in λ i := [Λ] ii ∈ [0, 1] for every j. Next, given this decomposition we can write: Φ J = D 1/2 QΛ t J Q T D -1/2 , Ψ j = D 1/2 Q(Λ tj -Λ tj+1 )Q T D -1/2 , 0 ≤ j ≤ J -1. where we set t 0 = 0 to simplify notations. Then, we have: Φ J x 2 D -1/2 = Φ J x, Φ J x D -1/2 = D -1/2 D 1/2 QΛ t J Q T D -1/2 x, D -1/2 D 1/2 QΛ t J Q T D -1/2 x = x T D -1/2 QΛ t J Q T QΛ t J Q T D -1/2 x = (x T D -1/2 QΛ t J )(Λ t J Q T D -1/2 x) = Λ t J Q T D -1/2 x 2 2 Further, since Q is orthogonal (as it is constructed from an eigenbasis of a symmetric matrix), if we consider a change of variable to y = Q T D -1/2 x, we have x 2 D -1/2 = D -1/2 x 2 2 = y 2 2 while Φ J x 2 D -1/2 = Λ t J y 2 2 . Similarly, we can also reformulate the operation of other filters in terms of diagonal matrices applied to y as W J as Ψ j x 2 D -1/2 = (Λ tj -Λ tj+1 )y 2 2 . Given the reformulation in terms of y and standard L 2 (G), we can now write Λ t J y 2 2 + J-1 j=0 (Λ tj -Λ tj+1 )y 2 2 = n i=1 y 2 i • λ 2t J + J-1 j=0 (λ tj i -λ tj+1 i ) 2 . Then, since 0 ≤ λ i ≤ 1 and 0 = t 0 < t 1 < • • • < t J we have λ 2t J + J-1 j=0 (λ tj i -λ tj+1 i ) 2 ≤   λ t J + J-1 j=0 λ tj i -λ tj+1 i   2 = λ t J + λ t0 i -λ t J i 2 = 1, which yields the upper bound Λ t J y 2 2 + J-1 j=0 (Λ tj -Λ tj+1 )y 2 2 ≤ y 2 2 . On the other hand, since t 1 > 0 = t 0 , then we also have λ 2t J + J-1 j=0 (λ tj i -λ tj+1 i ) 2 ≥ λ 2t J + (1 -λ t1 i ) 2 and therefore, by setting C := min 0≤ξ≤1 (ξ 2t J + (1 -ξ t1 ) 2 ) > 0, whose positivity is not difficult to verify, we get the lower bound Λ t J y 2 2 + J-1 j=0 (Λ tj -Λ tj+1 )y 2 2 ≥ C y 2 2 . Finally, applying the reverse change of variable to x and L 2 (G, D -1/2 ) yields the result of the theorem.

A.3 PROOF OF THEOREM 2

Denote the permutation group on n elements as S n , then for a permutation Π ∈ S n we let G = Π(G) be the graph obtained by permuting the vertices of G with Π. The corresponding permutation operation on a graph signal x ∈ L 2 (G, D -1/2 ) gives a signal Πx ∈ L 2 (G, D -1/2 ), which we implicitly considered in the statement of the theorem, without specifying these notations for simplicity. Rewriting the statement of the theorem more rigorously with the introduced notations, we aim to show that U p Πx = ΠU p x and S p,q Πx = S p,q x under suitable conditions, where the operation U p from G on the permuted graph G is denoted here by U p and likewise for S p,q we have S p,q . We start by showing U p is permutation equivariant. First, we notice that for any Ψ j , 0 < j < J we have that Ψ j Πx = ΠΨ j x, as for 1 ≤ j ≤ J -1 Ψ j Πx = (ΠP tj Π T -ΠP tj+1 Π T )Πx = Π(P tj -P tj+1 )x = ΠΨ j x. Similar reasoning also holds for j ∈ {0, J}. Further, notice that for the element-wise nature of the absolute value nonlinearity yields |Πx| = Π|x| for any permutation matrix Π. Using these two observations, it follows inductively that U p Πx :=Ψ jm |Ψ jm-1 . . . |Ψ j2 |Ψ j1 Πx|| . . . | =Ψ jm |Ψ jm-1 . . . |Ψ j2 Π|Ψ j1 x|| . . . | . . . =ΠΨ jm |Ψ jm-1 . . . |Ψ j2 |Ψ j1 x|| . . . | =ΠU p x. To show S p,q is permutation invariant, first notice that for any statistical moment q > 0, we have |Πx| q = Π|x| q and further as sums are commutative, j (Πx) j = j x j . We then have S p,q Πx = n i=1 |U p Πx[v i ]| q = n i=1 |ΠU p x[v i ]| q = n i=1 |U p x[v i ]| q = S p,q x, which, together with the previous result, completes the proof of the theorem.

B DATASETS

In this section we further analyze individual datasets. Relating composition of the dataset as shown in Table S1 to the relative performance of our models as shown in Table S2 . DD Dobson & Doig (2003) : Is a dataset extracted from the protein data bank (PDB) of 1178 high resolution proteins. The task is to distinguish between enzymes and non-enzymes. Since these are high resolution structures, these graphs are significantly larger than those found in our other biochemical datasets with a mean graph size of 284 nodes with the next largest biochemical dataset with a mean size of 39 nodes. Borgwardt et al. (2005) : Is a dataset of 600 enzymes divided into 6 balanced classes of 100 enzymes each. As we analyzed in the main text, scattering features are better able to preserve the structure between classes. LEGS-FCN slightly relaxes this structure but improves accuracy from 32 to 39% over LEGS-FIXED.

ENZYMES

NCI1, NCI109 Wale et al. (2008) : Contains slight variants of 4100 chemical compounds encoded as graphs. Each compound is separated into one of two classes based on its activity against nonsmall cell lung cancer and ovarian cancer cell lines. Graphs in this dataset are 30 nodes with a similar number of edges. This makes for long graphs with high diameter. PROTEINS Borgwardt et al. (2005) : Contains 1178 protein structures with the goal of classifying enzymes vs. non enzymes. GCN outperforms all other models on this dataset, however the Baseline model, where no structure is used also performs very similarly. This suggests that the graph structure within this dataset does not add much information over the structure encoded in the eccentricity and clustering coefficient. PTC Toivonen et al. (2003) : Contains 344 chemical compound graphs divided into two classes based on whether or not they cause cancer in rats. This dataset is very difficult to classify without features however LEGS-RBF and LEGS-FCN are able to capture the long range connections slightly better than other methods. COLLAB Yanardag & Vishwanathan (2015) : 5000 ego-networks of different researchers from high energy physics, condensed matter physics or astrophysics. The goal is to determine which field the research belongs to. The GraphSAGE model performs best on this dataset although the LEGS-RBF network performs nearly as well. Ego graphs have a very small average diameter. Thus shallow networks can perform quite well on them as is the case here. IMDB Yanardag & Vishwanathan (2015) : For each graph nodes represent actresses/actors and there is an edge between them if they are in the same move. These graphs are also ego graphs around specific actors. IMDB-BINARY classifies between action and romance genres. IMDB-MULTI classifies between 3 classes. Somewhat surprisingly GS-SVM performs the best with other LEGS networks close behind. This could be due to oversmoothing on the part of GCN and GraphSAGE when the graphs are so small. REDDIT Yanardag & Vishwanathan (2015) : Graphs in REDDIT-BINARY/MULTI-5K/MULTI-12K datasets each graph represents a discussion thread where nodes correspond to users and there is an edge between two nodes if one replied to the other's comment. The task is to identify which subreddit a given graph came from. On these datasets GCN outperforms other models. S2 shows that our model outperforms other GNNs on some biomedical benchmarks and that it performs comparably on social network datasets. Out of the six social network datasets, ignoring the fixed scattering model GS-SVM, which has been hand tuned with these datasets in mind, our model outperforms both GNN models on three of them, and is second best on the other three. This is at least comparable if not slightly superior performance. GraphSAGE does a bit better on Collab, but much worse on IMDB-Binary and Reddit-Binary. GCN does a bit better on Reddit-Multi, but worse on Collab, IMDB-Binary, and Reddit-Binary. LEGSNet has significantly fewer parameters and achieves comparable or superior accuracy on common benchmarks. Even when our method shows comparable results, and definitely when it outperforms other GNNs, we believe that its smaller number of parameters could be useful in applications with limited compute or limited training examples. Table S5 : Quantified distance between the empirically observed enzyme class exchange preferences of Cuesta et al. (2015) and the class exchange preferences inferred from LEGS-FIXED, LEGS-FCN, and a GCN. We measure the cosine distance between the graphs represented by the chord diagrams in Figure 2 . As before, the self-affinities were discarded. LEGS-Fixed reproduces the exchange preferences the best, but LEGS-FCN still reproduces well and has significantly better classification accuracy. LEGS-FIXED LEGS-FCN GCN 0.132 0.146 0.155 ensemble the learned features from a learnable scattering network (LEGS-FCN) with those of GCN and compare this to ensembling fixed scattering features with GCN as in Min et al. (2020) , as well as the solo features. Our setting is slightly different in that we use the GCN features from pretrained networks, only training a small 2-layer ensembling network on the combined graph level features. This network consists of a batch norm layer, a 128 width fully connected layer, a leakyReLU activation, and a final classification layer down to the number of classes. In Table S6 we see that combining GCN features with fixed scattering features in LEGS-FIXED or learned scattering features in LEGS-FCN always helps classification. Learnable scattering features help more than fixed scattering features overall and particularly in the biochemical domain.



Figure 1: LEGSNet learns to select the appropriate scattering scales from the data.

Figure 2: Enzyme class exchange preferences empirically observed in Cuesta et al. (2015), and estimated from LEGS and GCN embeddings.

Figure 3: CASP dataset LEGS-FCN % improvement over GCN in MSE of GDT prediction vs. Average GDT score.

Gilmer et al. (2017);Wu et al. (2018): Graphs in the QM9 dataset each represent chemicals with 18 atoms. Regression targets represent chemical properties of the molecules.B.1 PERFORMANCE OF LEGSNET ON SOCIAL NETWORK DATASETSTable

Dataset statistics, diameter, nodes, edges, and clustering coefficient averaged over graphs.

Mean ± standard deviation test set accuracy on biochemical datasets. Time limit expired (TLE) individual denotes models that did not finish in 10 hours.

Train and test set mean squared error on CASP GDT regression task over three seeds.



Dataset statistics, diameter, nodes, edges, clustering coefficient averaged over all graphs. Split into bio-chemical and social network types.

Mean ± std. over 10 test sets on bio-chemical and social datasets.

C TRAINING DETAILS

We train all models for a maximum of 1000 epochs with an initial learning rate of 1e -4 using the ADAM optimizer (Kingma & Ba, 2015) . We terminate training if validation loss does not improve for 100 epochs testing every 10 epochs. Our models are implemented with Pytorch Paszke et al. (2019) and Pytorch geometric. Models were run on a variety of hardware resources. For all models we use q = 4 normalized statistical moments for the node to graph level feature extraction and m = 16 diffusion scales in line with choices in Gao et al. (2019) .

C.1 CROSS VALIDATION PROCEDURE

For all datasets we use 10-fold cross validation with 80% training data 10% validation data and 10% test data for each model. We first split the data into 10 (roughly) equal partitions. For each model we take exactly one of the partitions to be the test set and one of the remaining nine to be the validation set. We then train the model on the remaining eight partitions using the cross-entropy loss on the validation for early stopping checking every ten epochs. For each test set, we use majority voting of the nine models trained with that test set. We then take the mean and standard deviation across these test set scores to average out any variability in the particular split chosen. This results in 900 models trained on every dataset. With mean and standard deviation over 10 ensembled models each with a separate test set.

D ENSEMBLING EVALUATION

Recent work by Min et al. (2020) combines the features from a fixed scattering transform with a GCN network, showing that this has empirical advantages in semi-supervised node classification, and theoretical representation advantages over a standard Kipf & Welling (2016) style GCN. We 0.017 ± 0.005 0.024 ± 0.008 0.368 ± 0.015 0.532 ± 0.404 0.015 ± 0.005 0.378 ± 0.013 Target 10 0.017 ± 0.005 0.024 ± 0.008 0.368 ± 0.015 0.533 ± 0.404 0.015 ± 0.005 0.380 ± 0.014 Target 11 0.254 ± 0.013 0.279 ± 0.023 0.548 ± 0.023 0.617 ± 0.282 0.294 ± 0.003 0.631 ± 0.013 Target 12 0.034 ± 0.014 0.033 ± 0.010 0.215 ± 0.009 0.356 ± 0.437 0.020 ± 0.002 0.478 ± 0.014 Target 13 0.033 ± 0.014 0.033 ± 0.010 0.214 ± 0.009 0.356 ± 0.438 0.020 ± 0.002 0.478 ± 0.014 Target 14 0.033 ± 0.014 0.033 ± 0.010 0.213 ± 0.009 0.355 ± 0.438 0.020 ± 0.002 0.478 ± 0.014 Target 15 0.036 ± 0.014 0.036 ± 0.011 0.219 ± 0.009 0.359 ± 0.436 0.023 ± 0.002 0.479 ± 0.014 Target 16 0.002 ± 0.002 0.001 ± 0.001 0.017 ± 0.034 0.012 ± 0.022 0.000 ± 0.000 0.033 ± 0.013 Target 17 0.083 ± 0.047 0.079 ± 0.033 0.280 ± 0.354 0.264 ± 0.347 0.169 ± 0.206 0.205 ± 0.220 Target 18 0.062 ± 0.005 0.176 ± 0.231 0.482 ± 0.753 0.470 ± 0.740 0.321 ± 0.507 0.368 ± 0.525

