EURNET: EFFICIENT MULTI-RANGE RELATIONAL MODELING OF SPATIAL MULTI-RELATIONAL DATA

Abstract

Modeling spatial relationship in the data remains critical across many different tasks, such as image classification, semantic segmentation and protein structure understanding. Previous works often use a unified solution like relative positional encoding. However, there exists different kinds of spatial relations, including short-range, medium-range and long-range relations, and modeling them separately can better capture the focus of different tasks on the multi-range relations (e.g., short-range relations can be important in instance segmentation, while longrange relations should be upweighted for semantic segmentation). In this work, we introduce the EurNet for Efficient multi-range relational modeling. EurNet constructs the multi-relational graph, where each type of edge corresponds to short-, medium-or long-range spatial interactions. In the constructed graph, EurNet adopts a novel modeling layer, called gated relational message passing (GRMP), to propagate multi-relational information across the data. GRMP captures multiple relations within the data with little extra computational cost. We study EurNets in two important domains for image and protein structure modeling. Extensive experiments on ImageNet classification, COCO object detection and ADE20K semantic segmentation verify the gains of EurNet over the previous SoTA FocalNet. On the EC and GO protein function prediction benchmarks, EurNet consistently surpasses the previous SoTA GearNet. Our results demonstrate the strength of EurNets on modeling spatial multi-relational data from various domains. Under review as a conference paper at ICLR 2023 To attain this goal, we propose the EurNet for Efficient multi-range relational modeling. In general, EurNets are a series of relational graph neural networks equipped with graph construction layers, where relational edges are constructed by the layers for capturing multi-range spatial interactions. When instantiated with different domain knowledge (e.g., computer vision or protein science), Eur-Nets can be specialized to tackle important problems like image classification, image segmentation and protein function prediction. To be specific, upon the raw data, EurNet first uses the graph construction layers to build different types of edges that respectively capture the short-, medium-and long-range spatial interactions within the data. For efficient multi-relational modeling over the constructed graph, we next introduce the gated relational message passing (GRMP) layer as the basic modeling module of EurNet. GRMP separately performs (1) relational message aggregation on each individual feature channel and (2) node-wise aggregation of different feature channels. Compared to the classical relational graph convolution (RGConv) (Schlichtkrull et al., 2018) , GRMP enjoys lower computational cost when more relations are to be modeled, and thus can handle more types of spatial interactions given the same computational budget. EurNet also supports dynamic graph construction and multi-stage modeling that are used in domains like image modeling. We demonstrate EurNets in image and protein structure modeling. To model image patches with different granularity, we build EurNets with hierarchical graph construction layers and multiple modeling stages and derive a model series with increasing capacity, i.e., EurNet-T, EurNet-S and EurNet-B. These models enjoy comparable or better top-1 accuracy (82.3% v.s. 82.3%; 83.6% v.s. 83.5%; 84.1% v.s. 83.9%) against the previous SoTA FocalNet (LRF) series (Yang et al., 2022) on ImageNet-1K classification (resolution: 224 × 224). Similar performance gains are preserved on COCO object detection and ADE20K semantic segmentation. To model protein alpha carbons, we build EurNet with a single-stage model architecture as GearNet Zhang et al. ( 2022). Under this fixarchitecture comparison, EurNet consistently outperforms the SoTA GearNet on standard protein function prediction benchmarks in terms of protein-centric maximum F-score (EC: 0.768 v.s. 0.730; GO-BP: 0.437 v.s. 0.356; GO-MF: 0.563 v.s. 0.503; GO-CC: 0.421 v.s. 0.414). These performance improvements remain when edge-level message passing is involved. Our results demonstrate that EurNet could be a strong candidate for modeling spatial multi-relational data in various domains.

1. INTRODUCTION

This work studies the data that lie in the 2D/3D space and incorporate interacting relations on different spatial ranges. A representative example is the image data, where an object in the image can interact with other adjacent objects via the direct touch, and it can also interact with those distantly relevant ones via gazing, waving hands or pointing. In protein science, the protein 3D structure is another typical example, in which different amino acids can interact in short range by peptide/hydrogen bonds, and they can also interact in medium and long ranges by hydrophobic interaction. We summarize such kind of data as spatial multi-relational data. In various domains, a lot of previous efforts have been made to model the spatial multi-relational data. For image modeling, multi-head self-attention mechanisms (Dosovitskiy et al., 2020; Liu et al., 2021b) , convolutional operations with large receptive fields (Ding et al., 2022; Yang et al., 2022) and MLPs for mixing full spatial information (Tolstikhin et al., 2021; Touvron et al., 2021a) are explored to capture multi-range spatial interactions within an image. For protein structure modeling, Zhang et al. (2022) builds multiple groups of edges for different short-range interactions and employs relational graph convolution (Schlichtkrull et al., 2018) for multi-relational modeling. These works either implicitly treat different kinds of spatial relations (i.e., short-range, medium-range and longrange relations) (Tolstikhin et al., 2021; Yang et al., 2022) or handle them by a unified scheme like relative positional encoding (Dosovitskiy et al., 2020; Liu et al., 2021b) . However, considering the relative importance of these spatial relations could vary across different tasks (e.g., the great importance of short-range relations in instance segmentation, and the upgraded importance of long-range relations in semantic segmentation), separately modeling each spatial relation is a better solution to capture different tasks' focus. Such a separate modeling approach remains to be explored, and, especially, the approach is expected to have efficient adaptation to large data and model scales.

2. RELATED WORK

Multi-relational data modeling. Multi-relational data are ubiquitous in the real world, e.g., knowledge graphs (Toutanova & Chen, 2015) and customer-product networks (Li et al., 2014) . To effectively model multiple types of relations/interactions, existing works have explored embedding-based methods (Bordes et al., 2013; Sun et al., 2019) , multi-headed attention (Vaswani et al., 2017) and different relational graph neural networks (GNNs) (Schlichtkrull et al., 2018; Vashishth et al., 2019; Busbridge et al., 2019; Zhu et al., 2021) . Previous relational GNNs mainly focus on model expressivity and parameter efficiency, and few works (Li et al., 2021) study the computational efficiency for relational modeling at scale. In addition, they can hardly model the spatial multi-relational data whose relational linking structures at different spatial ranges are not originally given (e.g., image patches). EurNet is designed to model such kind of data in a computationally efficient way. Image modeling. After the dominance of convolutional vision backbones (He et al., 2016; Tan & Le, 2019) in 2010s, researchers rethink the architectures for more effective image modeling in 2020s. Vision Transformers (Dosovitskiy et al., 2020; Liu et al., 2021b; Wang et al., 2021) replace convolutions with the self-attention mechanism (Vaswani et al., 2017) to better capture non-local interactions and gain SoTA performance. Following such successes, modern convolutional architectures (Liu et al., 2022; Yang et al., 2022 ), all-MLP architectures (Tolstikhin et al., 2021; Touvron et al., 2021a) and vision GNNs (Han et al., 2022) are designed to aggregate long-range spatial context. Some earlier works (Chen et al., 2019b; Zhang et al., 2019; 2020) realize non-local modeling by graph convolution on fully-connected or dynamic graphs. By comparison, EurNet captures multirange spatial interactions from a novel graph learning perspective, i.e., multi-relational modeling. Protein structure modeling. A variety of protein structure encoders have been developed to acquire informative protein representations on different structural granularity, including residue-level structures (Gligorijević et al., 2021; Zhang et al., 2022) , atom-level structures (Jing et al., 2021; Hermosilla et al., 2021) and protein surfaces (Gainza et al., 2020; Sverrisson et al., 2021) . This work focuses on the residue-level protein structure modeling. GearNet (Zhang et al., 2022 ) is a closely related work which explores multi-relational modeling of residue-level structures with short-range linking and relational graph convolution (RGConv) . By comparison, our EurNet models a broader range of interactions including short, medium and long ranges, and it studies the gated relational message passing (GRMP) as a more efficient and equally effective alternative of RGConv.

3.1. PROBLEM DEFINITION

This work studies the data V = {v i } N i=1 with N data units (e.g., patches in an image, alpha carbons in a protein, etc.) with the following structure: (1) spatial interaction on multiple ranges: data units can interact with each other across diverse spatial ranges; (2) multi-relational interaction: multiple interaction types (i.e., relations) exist between different units; (3) no canonical linking structure: the linking structures of multi-range interactions are not specified in the raw data. To effectively model such spatial multi-relational data, the model is expected to own following capabilities: (1) dynamic multi-range linking: the model can link relevant data units across different spatial ranges, and the linking structure can change along the whole model if desired; (2) multirelational linking: the model divides all links into multiple groups based on their interaction types; (3) efficient multi-relational modeling: the model can propagate information among interacting units by taking their interaction types into consideration, and it will not introduce too much extra computation when involving more relations. Keeping all these requirements in mind, we next introduce the high-level designs of EurNet, and we present its detailed instantiations in Sec. 4.

3.2. MULTI-RANGE RELATIONAL GRAPH CONSTRUCTION

We regard each data unit v ∈ V as a node in the graph. For the lack of canonical linking structure among the nodes, we therefore seek to build edges among them, especially with considering their interactions on multiple spatial ranges and dynamically adjusting the graph structure if desired. Multi-range relational edge construction. Given the concepts of spatial and semantic adjacency in a specific domain (e.g., computer vision or protein science), we construct three groups of edges E short = {{(u, v, r)}|r ∈ R short }, E medium = {{(u, v, r)}|r ∈ R medium } and E long = {{(u, v, r)}|r ∈ R long } to represent short-, medium-and long-range spatial interactions, where (u, v, r) denotes an edge from node u to node v with relation r, and R short /R medium /R long is the set of relations for short-/medium-/long-range interactions. To capture the interactions on different spatial ranges, all these edges are gathered into the edge set E = E short ∪ E medium ∪ E long = {{(u, v, r)}|r ∈ R} with the integrated relation set R = R short ∪ R medium ∪ R long . Now, the raw data V is structured as a multi-relational graph G = (V, E, R ) that is aware of diverse types of interactions within the data.

Dynamic edge construction.

A model can focus on different levels of semantics at different modeling stages. For example, for the image modeling problem we consider, a typical hierarchical image encoder (He et al., 2016; Liu et al., 2021b) is split into multiple stages, and it tends to encode lowlevel features in shallower stages and encode high-level semantics in deeper stages. To accommodate such a hierarchical modeling manner, our graph construction scheme will be dynamically performed before each modeling stage based on the input features (e.g., node coordinates or representations) of the stage, so that each modeling stage can explore its specific neighborhood structures of data units.

3.3. GATED RELATIONAL MESSAGE PASSING

To perform multi-relational modeling over the constructed graph G, the typical method Relational Graph Convolution (RGConv) (Schlichtkrull et al., 2018) employs a unique convolutional kernel matrix W r to aggregate the messages of relation r, leading to |R| different kernel matrices in total for the message aggregation from neighborhoods. Taking node v as an example, the RGConv layer updates its representation from z v to z ′ v as below: z aggr v = r∈R u∈Nr(v) 1 |N r (v)| W r z u , z ′ v = W self z v + z aggr v , where z aggr v is the aggregated message for node v, N r (v) = {u|(u, v, r) ∈ E} are v's neighbors with relation r, and W self is the weight matrix for self-update (we omit all bias terms for brevity). We assume that, when introducing a new relation, the in-degree of each node will increase by d on average. By taking the efficient implementation of RGConv with sparse matrix multiplication, it can be shown that the floating-point operations (FLOPs) of RGConv with C-dimensional input and output node features has the following form (see Appendix A for proof): FLOPs(RGConv) = |R| • (2 d|V|C + 2|V|C 2 ) + 2|V|C 2 + |V|C. (2) Therefore, the computational cost will scale with the relation number |R| by the factor of 2 d|V|C + 2|V|C 2 . Considering both the node number |V| and the feature dimension C could be large in many applications, the 2|V|C 2 term will be the main obstacle of exploring more relations with moderate extra computation, which hurts the model capacity under a strict constraint on computational cost. For more efficient multi-relational modeling, we aim at an approach that (1) can effectively model the interactions among relational messages and among feature channels, and (2) owns a gentle scaling behavior when modeling increasing number of relations within the data. To attain this goal, we propose the Gated Relational Message Passing (GRMP). Inspired by light-weight separable graph convolution methods (Balcilar et al., 2020; Li et al., 2021) that aggregate neighborhood features in a channel-wise way, GRMP decomposes the relation-channel entangled aggregation of RGConv into (i) the aggregation of intra-and inter-relation messages on each individual channel and (ii) the aggregation of different feature channels. Specifically, it consecutively performs following steps: 1 a pre-layer node-wise channel aggregation with the weight matrix W in , 2 an intra-relation message aggregation through channel-wise graph convolution, 3 an inter-relation message aggregation by node-adaptive weighted summation, 4 a post-layer node-wise channel aggregation with the weight matrix W out , and 5 the final node representation update by regarding the aggregated neighborhood information as gate. Formally, GRMP updates the representation of node v from z v to z ′ v as below: z aggr v = step 4 W out step 3 r∈R α r (v) • u∈Nr(v) 1 |N r (v)| w r ⊙ (W in z u step 1 ) step 2 , z ′ v = W self z v ⊙ z aggr v step 5 , where α(v) = W α z v ∈ R |R| are the attentive weights assigned to all relations on node v (W α is the weight matrix for node-adaptive relation weighting), w r is the channel-wise convolutional kernel vector for relation r (with the same shape as the node feature vector after step 1 ), and ⊙ denotes the Hadamard product. The definitions of z aggr v , N r (v) and W self follow Eq. ( 1), and all biases are omitted. We analyze the components of GRMP in Appendix H.1 and its expressivity in Appendix B. We also provide a graphical illustration of the GRMP layer in Appendix C. Under the efficient implementation with sparse matrix multiplication, GRMP consumes the FLOPs as below when taking C-dimensional input and output node features (see Appendix A for proof): Therefore, the relation number |R| scales the computational cost of GRMP with the scaling factor (2 d + 7)|V|C. Compared to the scaling factor 2 d|V|C + 2|V|C 2 of RGConv, this factor gets rid of the quadratic reliance on feature dimension and thus leads to a gentler scaling behavior when increasing the number of considered relations. In Fig. 1 , we compare the FLOPs of RGConv and GRMP when they respectively serve as the building block of EurNet-T for image modeling (image resolution: 224 × 224; "T" denotes the tiny-scale model). FLOPs(GRMP) = |R| • (2 d + 7)|V|C + 6|V|C 2 . ( ) In this illustrative comparison, we simply connect each node (i.e., image patch) with its K-nearest neighbors in terms of representation similarity, and the connection with the k-th nearest neighbor is regarded as the k-th relation, leading to K relations in total. We can observe that, when increasing the number of neighbors and thus the number of relations, the computational cost of GRMP-based model increases much more gently than the RGConv-based one. This merit enhances the efficiency and effectiveness of GRMP-based models in real-world problems like image and protein structure modeling, as studied in the second paragraph of Sec. 5.3 and in the Appendix H.3.

4. INSTANTIATIONS OF EURNET

In the main paper, we focus on two application domains, i.e., computer vision and protein science, where modeling spatial multi-relational data (i.e., images and protein structures) can solve important problems. In Appendix G, we further study the effectiveness of EurNet on modeling an important kind of multi-relational data without spatial information, i.e., knowledge graphs. • Edges for short-range interactions (|R short | = 4). We connect each patch with its up, down, left and right patches and regard each direction of adjacency as a relation. These edges capture the one-hop spatial neighbors and thus shortest-range spatial interactions of each image patch. • Edges for medium-range interactions (|R medium | = 1). In the medium range, a patch can interact with other patches sharing similar semantics (e.g., different body parts of deer in Fig. 2 ). We thus connect each patch with its K-nearest neighbors in terms of representation similarity measured by negative Euclidean distance (we analyze the sensitivity of K in Appendix I), and these edges are with the same relation. All edges connecting two patches within the same 2×2 window are removed to avoid short-range linking. • Edges for long-range interactions (|R long | = 2). To model long-range interactions, we introduce two kinds of virtual nodes and the associated edges. (1) A virtual node for whole-image representation is derived by global average pooling over all patch representations, and this virtual node is linked to all patches. (2) Per-patch virtual nodes for surrounding global context are got by a stack of depth-wise 2D convolutions (Yang et al., 2022) that aggregate each patch's contextual information with large receptive field and low computation; an edge links each of these virtual nodes to its corresponding patch with a different long-range interacting relation against that in (1), due to the different global context levels represented by two kinds of virtual nodes. By gathering all these edges representing 7 different relations, we have the edge set E, the relation (i.e., edge type) set R and the full graph G = (V, E, R) for multi-relational image modeling.

4.1.2. MODEL ARCHITECTURE

General architecture. In general, we follow the hierarchical image modeling architecture proposed by Swin Transformer (Liu et al., 2021b) , which is verified to be a superior architecture and is applied to many vision backbones (Liu et al., 2022; Yang et al., 2021; 2022) . Specifically, the whole model is divided into four stages that (1) reduce the number of patches (i.e., nodes in our graph) to a quarter across consecutive stages, and (2) use increasing number of feature channels [C, 2C, 4C, 8C] for all stages. Each stage contains multiple modeling blocks, where we construct each block with a GRMP layer (Sec. 3.3) for relational message passing and a feed-forward network (FFN) (Vaswani et al., 2017) for feature transformation. We adjust the number of feature channels and the number of blocks in each stage to get a model series with increasing capacity, i.e., EurNet-T, EurNet-S and EurNet-B. The detailed architectures of these models are displayed in Appendix D. Graph construction layers. To adapt the multi-stage modeling manner, we put a graph construction layer before each modeling stage of EurNet-T/S/B. In this way, based on the locations and representations of the patches fed into each stage, the multi-relational graph G will be reconstructed to adapt these stage inputs. In particular, along the modeling stages, the edges for medium-range interactions are expected to capture the semantic neighbors on different semantic levels (e.g., from the relevance of low-level features to the relevance of high-level semantics), as studied in Sec. 5.4.

4.2.1. RELATIONAL EDGES FOR SHORT-, MEDIUM-AND LONG-RANGE INTERACTIONS

In this work, we consider the alpha carbon (i.e., Cα) graph as the representation of protein structure, which is an informative and light-weight summary of the overall protein 3D structure and is widely used in the literature (Gligorijević et al., 2021; Baldassarre et al., 2021; Zhang et al., 2022) (see Appendix E for a preliminary introduction to protein structure). In specific, we extract all Cαs as the node set V of our graph, which, at this time, is actually a set of separate points in the 3D space, since there is no chemical bond among Cαs. To describe the multi-range spatial interactions within a protein, we build following relational edges (see Fig. 3 for a graphical illustration): • Edges for short-range interactions (|R short | = 6). We adopt two kinds of short-range edges proposed by Zhang et al. (2022) . ( 1) Sequential edges connect the Cα nodes that are within the distance of 2 on the protein sequence, where each of the sequential distances {-2,-1,0,1,2} is regarded as a single relation (i.e., 5 relations in total). ( 2) Radius edges connect the Cα nodes within the Euclidean distance of 10 angstroms, and all radius edges have the same relation. • Edges for medium-range interactions (|R medium | = 2). To capture medium-range interactions exclusively, for each Cα node, we first filter out all its neighbors within the sequential distance of 5 or within the Euclidean distance of 10 angstroms. We then connect it with the remaining nodes that are 5 nearest and 5∼10 nearest to it (measured by Euclidean distance), and the connections with these two sets of neighbors are regarded as two different relations. • Edges for long-range interactions (|R long | = 1). To capture the interactions beyond short-and mediumrange interactions above, we introduce a virtual node representing the whole protein by taking global average pooling over all Cα representations, and this virtual node is linked to all Cα nodes with a single relation. These edges make each Cα aware of the status of all other Cαs, and thus the long-range interactions beyond short and medium ranges can be captured. We gather all these edges with 9 different relations into the edge set E and the relation set R, which, together with V, derive the full graph G = (V, E, R) for multi-relational protein structure modeling.

4.2.2. MODEL ARCHITECTURE

This work focuses on comparing the graph construction and message passing schemes of EurNet against the SoTA GearNet (Zhang et al., 2022) , and we thus follow its single-stage model architecture for fair comparison. Specifically, EurNet performs graph construction once before this only modeling stage, and the input node feature is the one-hot encoding of each Cα's corresponding amino acid. Upon these inputs, six GRMP layers (Sec. 3.3) are stacked for relational modeling. After each layer, the sum pooling over all Cα representations is deemed as the whole-protein representation, and these per-layer protein representations are concatenated to produce the final output. Upon this output, EurNet performs a downstream task by appending a task-specific prediction head. We leave the design of the protein structure encoder with multiple modeling stages as a future work. Note that, in EurNet, all graph construction and message passing operations rely only on the quantities (e.g., sequential and Euclidean distance) that are invariant to translation, rotation and reflection. Therefore, EurNet satisfies E(3)-invariance (Mumford et al., 1994) .

5.1.1. BASELINE METHODS

We do point-by-point comparisons between our EurNet series, the SoTA ConvNeXt (Liu et al., 2022) and FocalNet (Yang et al., 2022) series, and other standard series including Swin Transformer (Liu et al., 2021b) , FocalAtt (Yang et al., 2021) and ViG (Han et al., 2022) . For completeness, we also report the results of EffNet (Tan & Le, 2019) , EffNetV2 (Tan & Le, 2021) , ViT (Dosovitskiy et al., 2020) , DeiT (Touvron et al., 2021b) , PVT (Wang et al., 2021 ), Mixer (Tolstikhin et al., 2021) , gMLP (Liu et al., 2021a) and ResMLP (Touvron et al., 2021a ) in applicable cases. Throughput analysis. The throughput of EurNet is higher than FocalAtt while lower than FocalNet and ConvNeXt. We point out that 2D convolutions (i.e., the core of FocalNet and ConvNeXt) are well supported by CUDA kernels, while such supports are still ongoing for graph operations (Chen et al., 2020; Min et al., 2021 ). EurNet's further speedup is expected under maturer CUDA supports.

5.1.3. OBJECT DETECTION ON COCO

Setups. This experiment benchmarks the object detection and instance segmentation performance on COCO 2017 (Lin et al., 2014) . All models are trained on 118K training images and evaluated on We compare with the SoTA GearNet (Zhang et al., 2017) under two settings, i.e., with and without edge message passing ("-Edge" in Tab. 4). We also include other baselines, i.e., 3DCNN_MQA (Derevyanko et al., 2018) , GCN (Kipf & Welling, 2016) , GAT (Veličković et al., 2017) , GVP (Jing et al., 2021) , GraphQA (Baldassarre et al., 2021) and New IEConv (Hermosilla & Ropinski, 2022) , for complete comparisons.

5.2.2. PROTEIN FUNCTION PREDICTION

Setups. This set of experiments compare different protein structure encoders on the EC (Gligorijević et al., 2021) and GO (Gligorijević et al., 2021) protein function prediction benchmarks. We follow GearNet to report the protein-centric maximum F-score F max , a commonly-used metric in CAFA challenges (Radivojac et al., 2013) . More dataset, model and training details are in Appendix F.4. Results. In Tab. 4, we can observe that EurNet consistently outperforms GearNet on all four tasks, and the performance gains preserve after involving edge message passing (details of edge message passing are stated in Appendix F.4). Since EurNet follows the single-stage model architecture of GearNet, we can conclude the effectiveness of medium-and long-range interaction modeling and GRMP-based multi-relational modeling, which are novel modeling mechanisms in EurNet. Effect of multi-range relational edges. In Tab. 5, we evaluate EurNet-T on ImageNet-1K with different ranges of edges.

5.3. ABLATION STUDY

When using a single range, the model with long-range edges achieves the highest accuracy 81.7%, which verifies the importance of capturing long-range interactions in image classification. By further adding short-or medium-range edges, the performance is promoted to 82.0%, where more fine-grained local interactions are captured. By using all three ranges of edges, the full model of EurNet-T obtains the 82.3% accuracy, which proves the complementarity of short-, medium-and long-range edges. Ablation study for protein structure is in Appendix H.2. Effect of GRMP layer. In Tab. 6, we compare between RGConv and GRMP under the comparable parameter number, FLOPs and throughput. (1) GRMP's dimensions are first set as [96, 192, 384, 768] in four stages. To reach comparable cost, RGConv can only have the dimensions of [84, 168, 336, 672] and achieves a lower accuracy 81.5% than GRMP's 82.3%. (2) After increasing RGConv's dimensions to [96, 192, 384, 768] , it aligns GRMP's performance while introduces more cost (1.3G more FLOPs). Under comparable cost, GRMP can have [108, 216, 432, 864] dimensions, leading to a higher accuracy 82.7%. These results demonstrate the better efficiency-performance trade-off gained by GRMP. Ablation study for protein structure modeling is in Appendix H.3.

5.4. VISUALIZATION

Fig. 4 displays some medium-range edges built by the EurNet-T trained on ImageNet-1K. The edges for the 2nd stage connect the patches with similar low-level features (e.g., the patches of red dog ears in Fig. 4 (a)), while the edges for the 4th stage connect semantically relevant patches (e.g., different body parts of two dogs in Fig. 4 (a)), which shows EurNet-T's hierarchical image modeling ability.

6. CONCLUSIONS AND FUTURE WORK

This work proposes the EurNet to model spatial multi-relational data like image patches and protein alpha carbons. It builds relational edges on multiple spatial ranges to describe the interactions in the data. It uses the gated relational message passing layer to model the built multi-relational graph, which can efficiently adapt to large data and model scales. The instantiations of EurNet have gained superior performance on various image and protein structure modeling tasks. In future works, we will adapt EurNet to more tasks of other domains like 3D point cloud modeling for object and scene understanding, and we will explore a general hierarchical multi-relational modeling method for the data from various domains.

REPRODUCIBILITY STATEMENT

For the sake of reproducibility, we use Tab. 7 to provide detailed architectures of EurNet-T, EurNet-S and EurNet-B for image modeling, state the detailed single-stage architecture of EurNet for protein structure modeling in Sec. 4.2.2, and describe the model configurations for each specific task in Sec. F. We state the detailed training configurations of all considered tasks in Sec. F. The derivation of the FLOPs of RGConv (Eq. ( 2)) and GRMP (Eq. ( 4)) are provided in Sec. A. In the supplementary material, we submit all source code for reproducing the results of ImageNet classification experiments, and the source code for COCO object detection, ADE20K semantic segmentation and EC and GO protein function prediction experiments will be released to public upon acceptance.

A FLOPS OF RGCONV AND GRMP

For FLOPs computation, we consider the multi-relational graph G = (V, E, R) with node set V, edge set E and relation (i.e., edge type) set R, and both input and output node features are with C feature channels. In addition, we assume that, when introducing a new relation, the in-degree of each node will increase by d on average. Proposition 1. To process the assumed multi-relational graph, the Relational Graph Convolution (RGConv) consumes the FLOPs as below under the efficient implementation with sparse matrix multiplication: FLOPs(RGConv) = |R| • (2 d|V|C + 2|V|C 2 ) + 2|V|C 2 + |V|C. Proof. We divide the computation of RGConv into three steps and compute the FLOPs of each step: 1 In the first step, the adjacency of all node pairs on |R| different relations are summarized in the adjacency matrix A ∈ R |V|×|R||V| , where the element A i,(j-1)|R|+k indicates the weight of the edge from the i-th node to the j-th node with the k-th relation: A i,(j-1)|R|+k = 1 |Nr k (vj )| there is an edge from i-th node to j-th node with k-th relation, 0 otherwise, (5) where N r k (v j ) = {u|(u, v j , r k ) ∈ E} is the neighborhood set of node v j with relation r k . Using this adjacency matrix, each node will have |R| different slots to receive the relational messages passed to it. All relational message passing operations can be realized by a sparse matrix multiplication: Z = A ⊤ Z, where Z ∈ R |V|×C denotes input node features, and Z ∈ R |R||V|×C denotes the relational slots of all nodes after message passing. By utilizing the sparsity of the adjacency matrix, this step consumes following FLOPs: FLOPs(RGConv-1 ) = 2|E|C = 2 d|R||V|C. ( ) 2 In the second step, we first integrate the relational slots of each node to get the reshaped Z ∈ R |V|×|R|C . At this time, each node is represented by a |R|C-dimensional vector, i.e., the aggregated messages of all relations. Next, we concatenate the convolutional kernel matrices of all relations to produce W conv ∈ R |R|C×C , and this matrix is applied upon Z to combine the messages in the same relational slot and aggregate messages across different relations: Z aggr = ZW conv , where Z aggr ∈ R |V|×C denotes the aggregated neighborhood information for each node. This step has the FLOPs as below: FLOPs(RGConv-2 ) = 2|R||V|C 2 . ( ) 3 In the final step, a self-update with matrix W self ∈ R C×C is first performed on the input feature of each node, and the self-updated node feature is further added with the aggregated neighborhood information: Z ′ = ZW self + Z aggr , ( ) where Z ′ ∈ R |V|×C denotes output node features. This step has the FLOPs as below: FLOPs(RGConv-3 ) = 2|V|C 2 + |V|C. ( ) Therefore, by summing up the computational cost of three steps, the RGConv consumes the following FLOPs in total: FLOPs(RGConv) = |R| • (2 d|V|C + 2|V|C 2 ) + 2|V|C 2 + |V|C. Proposition 2. To process the assumed multi-relational graph, the Gated Relational Message Passing (GRMP) consumes the FLOPs as below under the efficient implementation with sparse matrix multiplication: FLOPs(GRMP) = |R| • (2 d + 7)|V|C + 6|V|C 2 . Proof. Following the steps of GRMP stated in Eq. ( 3), we compute the FLOPs of each step: 1 In the first step, we conduct a pre-layer node-wise channel aggregation with the weight matrix W in ∈ R C×C : Z in = ZW in , ( ) where Z ∈ R |V|×C denotes the input node features, and Z in ∈ R |V|×C denotes the channelaggregated node features. This step has the FLOPs consumption as below: FLOPs(GRMP-1 ) = 2|V|C 2 . ( ) 2 In the second step, we first gather the messages within the same relation for each node, which is realized by the sparse matrix multiplication between Z in and the adjacency matrix A ∈ R |V|×|R||V| (A is identically defined as in the step 1 of Proposition 1): Zin = A ⊤ Z in , ( ) where Zin ∈ R |R||V|×C represents the relational slots of all nodes after message passing. The relational slots of each node are then integrated to get the reshaped Zin ∈ R |V|×|R|C . By concatenating the convolutional kernel vectors of all relations, we have w conv ∈ R |R|C×1 , and this vector is broadcast to all nodes to perform channel-wise message aggregation via Hadamard product: Zaggr = (1 conv w ⊤ conv ) ⊙ Zin , where 1 conv ∈ R |V|×1 is the all-one vector for broadcasting, and Zaggr ∈ R |V|×|R|C denotes the relational slots of all nodes after intra-relation message aggregation. To conduct the operations in Eqs. ( 14) and ( 15), this step consumes the following FLOPs: FLOPs(GRMP-2 ) = 2|E|C + 2|R||V|C = 2 d|R||V|C + 2|R||V|C. ( ) 3 In the third step, we first compute the attentive weights assigned to all relations on each node: M α = ZW α , where W α ∈ R C×|R| is the weight matrix for node-adaptive relation weighting, and M α ∈ R |V|×|R| denotes the relation weights on all nodes. After that, a weighted summation is performed to aggregate the messages of different relations in Zaggr (in this operation, we use the reshaped Zaggr ∈ R |V|×|R|×C and the reshaped M α ∈ R |V|×|R|×1 ): > Z aggr = |R| i=1 (M α :,i,: 1 ⊤ α ) ⊙ Zaggr :,i,: , where 1 α ∈ R C×1 is the all-one vector for broadcasting relation weights to all feature channels, and > Z aggr ∈ R |V|×C denotes the per-node neighborhood representations after interrelation message aggregation. To perform Eqs. ( 17) and ( 18), this step has the following FLOPs consumption: FLOPs(GRMP-3 ) = 2|R||V|C + |R| • 2|V|C + (|R| -1)|V|C = 5|R||V|C -|V|C. (19) 4 The fourth step conducts a post-layer node-wise channel aggregation with the weight matrix W out ∈ R C×C : Z aggr = > Z aggr W out , where Z aggr ∈ R |V|×C denotes the channel-aggregated neighborhood representations. This step consumes the FLOPs as below: FLOPs(GRMP-4 ) = 2|V|C 2 . ( ) 5 In the final step, the input feature of each node first performs self-update with the weight matrix W self ∈ R C×C , and the self-updated node feature is further updated by its neighborhood representation via a gating mechanism: Z ′ = ZW self ⊙ Z aggr , where Z ′ ∈ R |V|×C denotes output node features. This step has the FLOPs as below: FLOPs(GRMP-5 ) = 2|V|C 2 + |V|C. Therefore, by summing up the computational cost of five steps, the GRMP has the following FLOPs consumption in total: FLOPs(GRMP) = |R| • (2 d + 7)|V|C + 6|V|C 2 . B ANALYSIS OF MODEL EXPRESSIVITY In this section, we study the expressivity of the proposed GRMP layer (Sec. 3.3). Specifically, we introduce the variant of the Weisfeiler-Leman (WL) algorithm (Morris et al., 2019) on multirelational graphs and show that there exists parameterization of GRMP that is as expressive as the multi-relational WL algorithm. Following the philosophy of the 1-dimensional Weisfeiler-Leman (1-WL) algorithm (Morris et al., 2019) , we define the multi-relational 1-WL (1-RWL) algorithm. This algorithm studies a labeled multi-relational graph G = (V, E 1 , . . . , E |R| , l), where V is the node set, E i denotes the edge set associated with the i-th relation, and l is the label function that assigns initial node features. The 1-RWL computes a node coloring C (t) : V → N for each iteration t ⩾ 0, and the initial coloring C (0) is consistent with the label function l (i.e., one unique color for the nodes with a specific label). For iteration t > 0, 1-RWL updates the color of each node v ∈ V based on the colors of itself and its neighbors with different relations in the last iteration: C (t) (v) := HASH C (t-1) (v), { {(C (t-1) (u), i)|i ∈ [|R|], u ∈ N i (v)} } , ∀v ∈ V, where [|R|] = {1, . . . , |R|} denotes the indices of all relations, N i (v) is the neighborhood set of node v with the i-th relation, { {. . . } } denotes a multiset. To test the isomorphism of two multirelational graphs G and G ′ , the 1-RWL algorithm is run in parallel on both graphs. If the number of nodes assigned with a specific color is different across two graphs at an iteration, it is concluded that G and G ′ are non-isomorphic. The algorithm terminates when the color assignments do not change across two iterations, which is reached after at most max{|V|, |V ′ |} iterations (V and V ′ are the node sets of two graphs). Just as the 1-WL test (Cai et al., 1992) , the same color assignments along the whole process of 1-RWL cannot guarantee the isomorphism of two graphs, while it is still a powerful heuristic for (1) distinguishing the nodes with different structural roles in a multi-relational graph and (2) distinguishing non-isomorphic multi-relational graphs (Babai & Kucera, 1979) . We next compare the expressivity between the GRMP layer and the 1-RWL algorithm. For multirelational graph G, we denote F ∈ R |V|×d as the node feature matrix and A i ∈ R |V|×|V| as the adjacency matrix for the i-th relation (i ∈ [|R|]). The node feature update rule of GRMP can be written as below: F ′ = (FW self + b self ) ⊙ i∈[|R|] α i A i (FW in + b in ) W out + b out , where W self , W in and W out are the parameter matrices (i.e., weights), b self , b in and b out are the parameter vectors (i.e., biases), and α i (i ∈ [|R|]) are the per-relation scaling factors. Note that, in this analysis, we simplify GRMP's intra-and inter-relation message aggregation (step 2 and step 3 in Eq. ( 3)) as per-relation scaling, where the expressivity of this simplified GRMP is upper bounded by the original one. Following Morris et al. (2019) , we consider the model with a stack of GRMP layers and denote the sequence of GRMP parameters W (t) GRMP up to the t-th layer as below: W (t) GRMP = W (t ′ ) self , W (t ′ ) in , W (t ′ ) out , b (t ′ ) self , b (t ′ ) in , b (t ′ ) out , α i t ′ ⩽t,i∈[|R|] . On such basis, we next illustrate there exists parameters W (t) GRMP such that the corresponding model is as expressive as the coloring C (t) in terms of distinguishing nodes in multi-relational graphs. Theorem 1. Let G = (V, E 1 , . . . , E |R| , l) be a labeled multi-relational graph. For all t ⩾ 0, there exists initial node features and a sequence W (t) GRMP of GRMP parameters such that the following holds: C (t) (v) = C (t) (w) ⇐⇒ F (t) v = F (t) w , ∀v, w ∈ V. In other words, the node feature matrix F (t) is equivalent to the coloring function C (t) of 1-RWL at all iterations. Proof. (1) Prerequisites. Following Morris et al. (2019) , a matrix is denoted as row-independent modulo equality if the set of all different rows in the matrix are linearly independent. For two coloring functions C 1 and C 2 of G, we denote their equivalence C 1 ≡ C 2 if they define the same partition over the node set V. We next prove the result by induction. (2) Base case. For t = 0, we define the initial node features F (0) to be row-independent modulo equality and consistent with the label function l (e.g., the one-hot encoding of node labels). Since the initial coloring function satisfies C (0) = l, we can conclude the equivalence of C (0) and F (0) , i.e., C (0 ) (v) = C (0) (w) ⇔ F (0) v = F (0) w , ∀v, w ∈ V. (3) Induction step. For t ⩾ 0, we assume that C (t) and F (t) are equivalent, and F (t) is rowindependent modulo equality. The coloring C (t+1) of 1-RWL at iteration t+1 is derived by applying a 1-RWL step to update the coloring C (t) : C (t+1) (v) := HASH C (t) (v), { {(C (t) (u), i)|i ∈ [|R|], u ∈ N i (v)} } , ∀v ∈ V. Let q be the number of different colors defined by C (t) and let Q 1 , . . . , Q q be the q different node subsets partitioned by C (t) . The updated coloring C (t+1) can be equivalently represented by the matrix D ∈ R |V|×q(|R|+1) with following entries: D vk =    |N i (v) ∩ Q j | if k = iq + j for i ∈ [|R|], j ∈ [q], 1 if k ∈ [q] and v ∈ Q k , 0 otherwise, in which the corresponding row of node v is the concatenation of an one-hot encoding of v's color and a vector encoding for the multiset of the colors in N i (v), for each i ∈ [|R|] . Based on the row encoding of nodes in D, we can partition the node set V into subsets (the nodes in the same subset share the same row encoding) and assign a unique color to each subset, which defines the coloring function C D . The equivalence C D ≡ C (t+1) holds. Since C (t) and F (t) are assumed to be equivalent, there should be q distinct rows in F (t) , and each of them corresponds to one of q different colors defined by C (t) . Let F(t) ∈ R q×d be the matrix composed of these distinct rows with the order corresponding to Q 1 , . . . , Q q . By assumption, the rows of F(t) are linearly independent, and thus there is a matrix M ∈ R d×q such that F(t) M ∈ R q×q is an identity matrix. By extension, the matrix F (t) M ∈ R |V|×q has entries: (F (t) M) vj = 1 if v ∈ Q j , 0 otherwise. ( ) Note that, the matrix D defined in Eq. ( 29  E = |V| |R|+1 • F (t) M + 1 ⊙ i∈[|R|] |V| i • A i F (t) M + 1 , where 1 ∈ N |V|×q is an all-one matrix with fitted shape. The matrix E defines a coloring function C E in the same way as the matrix D, and the equivalence C E ≡ C D holds. By aligning the update rule of GRMP (Eq. ( 26)) with the definition of matrix E (Eq. ( 31)), we adopt the parameterization W self = |V| |R|+1 M, W in = M, W out = I q (the q × q identity matrix), b self = 1, b in = 0 |V|,q (the |V| × q zero matrix), b out = 1, and α i = |V| i . In this way, the node feature matrix F (t+1) at iteration t + 1 is updated from F (t) as below: F (t+1) = |V| |R|+1 • F (t) M + 1 ⊙ i∈[|R|] |V| i • A i (F (t) M + 0 |V|,q ) I q + 1 = E, Therefore, the coloring function C F (t+1) defined by the updated node feature matrix F (t+1) satisfies: C F (t+1) ≡ C E ≡ C D ≡ C (t+1) . In particular, we have: C (t+1) (v) = C (t+1) (w) ⇔ F (t+1) v = F (t+1) w , ∀v, w ∈ V. (4) Conclusion. Since both the base case and the induction step have been proved to be true, we can conclude that: C (t) (v) = C (t) (w) ⇐⇒ F (t) v = F (t) w , ∀t ⩾ 0, ∀v, w ∈ V. This result proves the equivalent expressivity of the 1-RWL algorithm and the model constructed by simplified GRMP layers (Eq. ( 26)). Therefore, when constructing the model with standard GRMP layers (Eq. ( 3)), the model is at least as expressive as the 1-RWL algorithm. C GRAPHICAL ILLUSTRATION OF GRMP . . . In Fig. 5 , we graphically illustrate the mechanism of node representation update in the GRMP layer. In specific, GRMP updates the node representation matrix from Z to Z ′ with the following steps: 1 A linear layer transforms the input node representations Z ∈ R |V|×C to Z in ∈ R |V|×C , which aggregates the feature channels of each node at the beginning of the layer. 2 For each node, its neighbors are assigned to different groups according to their relations with the node, and the neighbors in each group are aggregated in a channel-wise way. 

D DETAILED MODEL ARCHITECTURE FOR IMAGE MODELING

For image modeling, we basically follow the hierarchical architecture proposed by Swin Transformer (Liu et al., 2021b) , as summarized in Tab. 7. The architecture begins with a patch embedding module implemented by non-overlapping 2D convolution. After that, the model is split into 4 modeling stages: (1) the number of patches (i.e., nodes in our graph) is reduced to a quarter across consecutive stages by the "PatchMerging" operation (Liu et al., 2021b) ; (2) increasing feature channels [C, 2C, 4C, 8C] are used for all stages. We place a graph construction layer before each modeling stage to update the multi-relational graph structure. For the first stage, we only use short-and longrange edges to reduce the computational cost (computing medium-range edges by representation similarity comparison is expensive in the first stage with many patches), and the relational edges of all three ranges are adopted in the last three stages. Each stage is composed of multiple modeling blocks, where each block contains a GRMP layer (Sec. 3.3) for relational message passing and a feed-forward network (FFN) (Vaswani et al., 2017) for feature transformation. In the end, a global average pooling layer produces the whole-image representation, and a linear head outputs the final prediction. We adjust the number of feature channels and the number of blocks in each stage to derive EurNet-T, EurNet-S and EurNet-B with standard number of parameters and FLOPs. We implement the models based on the PyTorch (Paszke et al., 2017) deep learning library.

E INTRODUCTION TO PROTEIN STRUCTURE

Proteins are macromolecules that perform critical biological functions in living organisms. A protein owns multiple levels of structures, as described below: • Primary structure (Fig. 6 (a)). At the chemical level, a protein is composed of one or multiple chains of amino acid residues, forming the protein sequence which is the primary protein struc- ture. In the protein sequence s = (s 1 , s 2 , • • • , s L ), each element s l denotes a type of amino acid (there are 20 common amino acids and two rare ones, i.e., Selenocysteine and Pyrrolysine). The primary structure tells the sequential order of amino acids in a protein, but otherwise it does not reveal any information about the 3D folded structure of the protein. This fact limits its usefulness in the analysis/prediction of protein functions, due to the principle that "protein folded structures largely determine their functions" (Harms & Thornton, 2010). • Secondary structure (Fig. 6 (b)). The secondary structures of proteins are some repeatedlyoccurred local structures like the α-helices shown in Fig. 6(b ). These structures are stabilized by hydrogen bonds, and, together with the tight turns and flexible loops in between, they constitute the complete protein folded structure. • Tertiary structure (Fig. 6 (c)). The spatial arrangement of different secondary structure components leads to the formation of the tertiary structure (i.e., the folded structure of a protein). The tertiary structure is jointly held by short-range interactions like hydrogen bonding and long-range interactions like hydrophobic interactions. Thanks to the recent advances of highly accurate protein folded structure predictors based on deep learning (Jumper et al., 2021; Baek et al., 2021) , we can now efficiently acquire numerous previously unknown protein tertiary structures with reasonable confidence. These advances are expected to promote the understanding of protein functions based on tertiary structures. In this work, we focus on protein function prediction tasks based on tertiary structures. Specifically, we adopt an informative and light-weight representation format, i.e., all alpha carbons (Cαs) in the tertiary structure (Fig. 6(d )), which is widely used in the literature (Gligorijević et al., 2021; Baldassarre et al., 2021; Zhang et al., 2022) . A Cα can be seen as the center of its corresponding amino acid, and thus the overall tertiary structure of a protein can be well captured by the collection of all Cαs. At this time, the Cαs are actually a set of separate points in the 3D space, since there is no chemical bond among them. To better describe the interactions within a protein, we seek to construct edges among Cαs and lead to a more informative representation format, i.e., the Cα graph. Model configurations. The whole model architectures of EurNet-T, EurNet-S and EurNet-B are presented in Tab. 7. For medium-range edges, 12 nearest semantic neighbors of each patch are linked to it to capture medium-range interactions. For long-range edges, we compute the representations of per-patch global-context virtual nodes by a stack of depth-wise 2D convolutions with the accumulative receptive field as 7, and these virtual nodes are linked to their corresponding patches.



Finally, Z aggr serves as the gate to update all node representations, deriving the output node representations Z ′ ∈ R |V|×C .



Figure 1: FLOPs trend of RGConv and GRMP under different relation numbers, evaluated on EurNet-T for image modeling.

Figure 2: Multi-range relational edges for image.

Figure 3: Multi-range relational edges for protein Cαs. Abbr., dist.: distance.

Figure 4: Medium-range edges built by EurNet-T (we use different colors for different selected target nodes).

) can be viewed as a block matrixD = [B 0 B 1 . . . B |R| ], where B 0 = F (t) M ∈ N |V|×q and B i = A i F (t) M ∈ N |V|×q for each i ∈ [|R|].Since each element of D is upper bounded by |V| -1, we follow polynomial coding to assign B i (i ∈ [|R|]) with the polynomial term |V| i and assign B 0 with the term |V| |R|+1 , defining the following matrix E ∈ N |V|×q which is equivalent to D in terms of partitioning nodes based on the rows of matrix:

Figure 5: Graphical illustration for node representation update in the GRMP layer. We specifically show the neighborhood aggregation and representation update procedure of the node denoted in red. Abbr., Multi.: multiply with; Rel.: relation; aggr.: aggregation.

= 384), FFN (C = 384, γ = 4) ×6 GRMP (C = 384), FFN (C = 384, γ = 4) ×18 GRMP (C = 512), FFN (C = 512, γ = 4)

Figure 6: The primary structure, secondary structure, tertiary structure and all alpha carbons of the single-chain insulin protein (ID in PDB (Berman et al., 2000): 2LWZ).

MORE EXPERIMENTAL SETUPS ON IMAGENET-1K CLASSIFICATIONIn the following, we state the detailed model and training configurations of (1) training on ImageNet-1K from scratch and (2) pretraining on ImageNet-22K followed by ImageNet-1K fine-tuning. For training configurations, we mainly follow the standards set up by Swin Transformer(Liu et al.,  2021b)  for fair comparison.F.1.1 FROM-SCRATCH TRAINING ON IMAGENET-1K

ImageNet-1K classification results. We measure throughput on a V100 GPU. † denotes the model pre-trained on ImageNet-22K. 224 2 and 384 2 denote the image size. "↑384" means fine-tuning on 384×384 images for 30 epochs.

COCO object detection and instance segmentation results with Mask R-CNN(He et al., 2017). 5K validation images. Two standard training schedules, i.e., the 1× schedule with 12 epochs and the 3× schedule with 36 epochs, are used for benchmarking. Detailed setups are stated in Appendix F.2.Results. In Tab. 2, EurNet performs comparably to FocalNet (LRF) on the tiny and small model scales. We can observe the superiority of EurNet-B over FocalNet (LRF) -B on the base model scale (better performance on all 12 metrics). The base-scale EurNet-B owns [2, 2, 18, 2] modeling blocks (more than EurNet-T) and[128, 256, 512, 1024]  feature channels (more than EurNet-S) for four modeling stages. Therefore, larger message passing hops (achieved by more modeling blocks) coupled with larger model width favor EurNet's performance on high-resolution dense prediction tasks.5.1.4 SEMANTIC SEGMENTATION ON ADE20KTable3: ADE20K semantic segmentation results with UperNet(Xiao et al., 2018).

F max results on EC and GO protein function prediction benchmarks.

Ablation study of multi-range edges on ImageNet-1K with EurNet-T.

Ablation study of multi-relational modeling layer on ImageNet-1K with EurNet-T.



The aggregated messages of different relational groups are then scaled by per-relation scalar weights {α r }

Detailed architectures of EurNet-T/S/B for ImageNet-1K classification (#parameters and FLOPs are computed under the resolution 224×224). H ×W : input image resolution; C: number of feature channels; γ: FFN's hidden dimension ratio; K: number of K-nearest neighbors for mediumrange edges; Y: label set for classification. "T" denotes the tiny model; "S" denotes the small model; "B" denotes the base model.

annex

Training configuration. An AdamW (Loshchilov & Hutter, 2017) optimizer (betas: [0.9, 0.999], weight decay: 0.05) is employed to train each EurNet model for 300 epochs. We set the batch size as 2048, the base learning rate as 0.002 and the gradient clipping norm as 5.0. A cosine learning rate scheduler is adopted to adjust the learning rate from 2.0 × 10 -6 to 0.002 in the first 20 warm-up epochs, and the learning rate is decayed to 2.0 × 10 -5 in the rest epochs with a cosine rate. The stochastic depth drop rates are set to 0.15, 0.3 and 0.5 respectively for EurNet-T, EurNet-S and EurNet-B. We follow the augmentation functions and mixup strategies used in Swin Transformer. All experiments are conducted on 16 Tesla-V100-32GB GPUs. Training configuration. For ImageNet-22K pre-training, we train EurNet-B with an AdamW optimizer (betas: [0.9, 0.999], weight decay: 0.05) for 90 epochs with the batch size 4096 and the image resolution 224 × 224. A cosine learning rate scheduler is employed to linearly increase the learning rate from 0 to 4.0 × 10 -3 in the first 5 warm-up epochs, and it decays the learning rate to 1.0 × 10 -6 in the rest epochs with a cosine rate. The stochastic depth drop rate is set as 0.1. All augmentation functions and mixup strategies follow Swin Transformer. The pre-training is on 64 Tesla-V100-32GB GPUs.For ImageNet-1K fine-tuning, the pre-trained model is fine-tuned for 30 epochs by an AdamW optimizer (betas: [0.9, 0.999], weight decay: 1.0×10 -8 ). The cosine learning rate scheduler adjusts the learning rate from 8.0 × 10 -8 to 8.0 × 10 -5 in the first 5 warm-up epochs, and the learning rate is decayed to 8.0 × 10 -7 in the rest epochs with a cosine rate. The stochastic depth drop rate is set as 0.2. Both Mixup (Zhang et al., 2017) and CutMix (Yun et al., 2019) are muted during fine-tuning, following FocalNet Yang et al. (2022) . The fine-tuning is performed on 16 Tesla-V100-32GB GPUs.

F.1.3 THROUGHPUT COMPUTATION

We follow Swin Transformer to measure the inference throughput on a Tesla-V100-32GB GPU with batch size 128. We adopt graph checkpoints to enhance the speed of inferring an image that has been seen. During inference, we add short-range edges to the list of medium-range edges and merge their corresponding relations to further promote the efficiency.

F.2 MORE EXPERIMENTAL SETUPS ON COCO OBJECT DETECTION

Model configurations. We use the EurNet-T, EurNet-S and EurNet-B pre-trained on ImageNet-1K as the backbone of Mask R-CNN (He et al., 2017) . In specific, we take the patch representations output from all four modeling stages as the inputs of the Feature Pyramid Network (FPN) Lin et al. (2017) . For medium-range edge construction on the high-resolution images of COCO, we select the semantic neighbors of each patch from a 112 × 112 dilated window (dilation ratio: 2) to reduce the computational cost. For long-range edge construction, the representations of per-patch globalcontext virtual nodes are computed by a stack of depth-wise 2D convolutions with the accumulative receptive field as 31, and these virtual nodes are linked to their corresponding patches.Training configurations. We follow Swin Transformer (Liu et al., 2021b) to adopt a multi-scale training strategy where the shorter side of an image is resized to [480, 800] , and the longer side is with length 1,333. An AdamW (Loshchilov & Hutter, 2017) optimizer (betas: [0.9, 0.999], weight decay: 0.05) with initial learning rate 1.0 × 10 -4 is employed for model training. In the 1× schedule with 12 total epochs, the learning rate is decayed at the 9th and 11th epoch with the decay rate 0.1. In the 3× schedule with 36 total epochs, the learning rate is decayed at the 27th and 33rd epoch with the decay rate 0.1. The stochastic depth drop rate is set as 0.1, 0.2, 0.3 in 1× schedule and 0.25, 0.5, 0.5 in 3× schedule for EurNet-T/S/B, respectively. All models are trained with batch size 8 on 8 Tesla-V100-32GB GPUs (i.e., one image per GPU). Our implementations are based on the mmdetection (Chen et al., 2019a) framework.

F.3 MORE EXPERIMENTAL SETUPS ON ADE20K SEMANTIC SEGMENTATION

Model configurations. The EurNet-T, EurNet-S and EurNet-B pre-trained on ImageNet-1K serve as the backbone of UperNet (Xiao et al., 2018) to perform semantic segmentation. The patch representations output by all four modeling stages serve as the inputs of the Feature Pyramid Network (FPN) Lin et al. (2017) . For medium-range edge construction, each patch is connected with its semantic neighbors from a 144 × 144 dilated window (dilation ratio: 2). For long-range edge construction, we use a stack of depth-wise 2D convolutions with accumulative receptive field 31 to compute the representations of per-patch global-context virtual nodes, and we connect these virtual nodes with their corresponding patches.Training configurations. All input images are resized to the resolution 512 × 512. We adopt an AdamW (Loshchilov & Hutter, 2017) optimizer (betas: [0.9, 0.999], weight decay: 0.01) to train the model for 160K iterations with the base learning rate 6.0 × 10 -5 . All models are trained with batch size 16 on 8 Tesla-V100-32GB GPUs (i.e., two images per GPU). Our implementations are based on the mmsegmentation (Contributors, 2020) framework.

F.4 MORE EXPERIMENTAL SETUPS ON PROTEIN FUNCTION PREDICTION

Edge message passing. Zhang et al. (2022) proposes to enhance the GearNet by edge-level message passing, which well captures the interactions between edges. To compare with the GearNet-Edge model enhanced in this way, we adapt the same edge message passing scheme to our EurNet. Specifically, based on the constructed multi-relational graph G = (V, E, R), we further construct a line graph (Harary & Norman, 1960) In this graph, each node v ∈ V line corresponds to an edge in the original graph G. There will an edge (u, v, r) between nodes u, v ∈ V line if the corresponding edges of u and v are adjacent in the original graph, and the edge type u,v) denotes the angle between the corresponding edges of u and v in the original graph). Based on this multi-relational line graph, we employ the GRMP layer (Sec. 3.3) to propagate information between the nodes in G line and thus between the edges in the original graph G. Readers are referred to Zhang et al. (2022) for more details. We name the EurNet equipped with such an edge message passing scheme as EurNet-Edge.Dataset details. Two standard protein function prediction benchmarks are used in our experiments:• Enzyme Commission (EC) number prediction Gligorijević et al. (2021) requires the model to predict the EC numbers of a protein based on its tertiary structure, where the EC numbers describe the protein's catalysis of biochemical reactions. This task involves the binary prediction of 538 different EC numbers, forming 538 binary classification problems. This dataset contains 15,550 training, 1,729 validation and 1,919 test proteins.• Gene Ontology (GO) term prediction (Gligorijević et al., 2021) seeks to predict the GO terms owning by a protein based on its tertiary structure. This benchmark is further split into three branches based on three types of ontologies: biological process (BP), molecular function (MF) and cellular component (CC). Each branch is formed by multiple binary classification problems.The GO benchmark dataset contains 29,898 training, 3,322 validation and 3,415 test proteins.Model configurations. The backbone architecture of EurNet is described in Sec. 4.2.2. Based on this backbone, we append a three-layer MLP with the architecture Linear(C out , C out ) → ReLU → Linear(C out , C out ) → ReLU → Linear(C out , N task ) to predict the binary classification logits of all tasks simultaneously (C out : the dimension of output protein representation; N task : the number of binary classification tasks). We employ the binary cross entropy loss for model optimization.Training configurations. An AdamW (Loshchilov & Hutter, 2017) optimizer (betas: [0.9, 0.999], weight decay: 0) is utilized to train the model for 200 epochs. We adopt a cosine learning rate scheduler to linearly increase the learning rate from 1.0 × 10 -7 to 1.0 × 10 -4 , and the learning rate is decayed to 1.0 × 10 -6 in the rest epochs with a cosine rate. All models are trained with batch size 16 on 4 Tesla-V100-32GB GPUs (i.e., four proteins per GPU).Table 8 : Performance comparison on knowledge graph completion benchmarks. "↓" denotes the metric is the lower the better; "↑" denotes the metric is the higher the better. Training and evaluation. For model training, we follow the default setting in the TorchDrug library (Zhu et al., 2022) to sample 32 negative triplets for each positive triplet and perform binary classification with the binary cross entropy loss. On both knowledge graphs, the EurNet is trained for 20 epochs by an Adam optimizer with learning rate 5.0 × 10 -3 and batch size 16. Model training is performed on 4 Tesla-V100-32GB GPUs. For evaluation, we follow previous works Vashishth et al. (2019) ; Zhu et al. (2021) to report mean rank (MR), mean reciprocal rank (MRR) and HITS at N (H@N) for knowledge graph completion.Baselines. We compare the proposed EurNet with four classical knowledge graph embedding methods, i.e., TransE (Bordes et al., 2013) , DistMult (Yang et al., 2014) , ComplEx (Trouillon et al., 2016) and RotatE (Sun et al., 2019) , and two typical relational GNNs, i.e., RGCN (Schlichtkrull et al., 2018) and CompGCN (Vashishth et al., 2019) .Results. We present the performance of EurNet and baselines in Tab. 8. It can be observed that EurNet clearly outperforms the embedding-based and GNN baselines on all metrics of two datasets.Although knowledge graphs contain no spatial information, they are representative multi-relational graphs and are good test fields for evaluating the capacity of relational GNNs. The superior performance of EurNet on these benchmarks demonstrates the effectiveness of the GRMP layer on modeling the complex relational patterns in knowledge graphs.

H MORE ABLATION STUDY

H.1 EFFECT OF GRMP COMPONENTS In Tab. 9, we analyze the key components of GRMP by substituting or removing the original component. This part of ablation studies are conducted on ImageNet-1K classification with EurNet-T.Effect of gating mechanism. In the first row of the second block, we study the importance of the gating mechanism in GRMP by substituting the Hadamard product in the step 5 of Eq. ( 3) with the addition. After such a change, the top-1 accuracy decays by 0.7%. This performance decay demonstrates that, by using the separable graph convolution scheme in GRMP, the gating operation is more suitable than addition for node representation update (in contrast to the additive node representation update of RGConv in Eq. ( 1)), which shares similar insights with the modulation mechanism in FocalNet (Yang et al., 2022) .Effect of node-adaptive relation weighting. In the second row of the second block, we replace GRMP's node-adaptive relation weighting operation with simply taking the mean over all relations. This change leads to a 0.4% drop of accuracy. This relation weighting operation helps the GRMP layer to adaptively aggregate the messages of different relations based on each node's status, which benefits the model performance.Effect of pre-layer and post-layer node-wise channel aggregation. In the third and fourth rows of the second block, we respectively evaluate the model variants without W in and W out . Under these two settings, the model accuracy decays by 0.6% and 0.8%, respectively. Therefore, it is important to perform both pre-layer and post-layer node-wise channel aggregation in the GRMP layer. (2) We then increase RGConv's hidden dimension to 512. At this time, RGConv achieves the F max score 0.767 which is comparable to GRMP's performance under the same dimension, while its throughput is decreased by 3.2. Under the comparable throughput, GRMP can have the hidden dimension of 592, which leads to a higher F max score 0.780. These results demonstrate that GRMP owns a better efficiency-performance trade-off than RGConv on protein structure modeling. Image modeling sensitivity to semantic neighbor size. In Tab. 12, we report the performance of EurNet-T on ImageNet-1K classification under different semantic neighbor sizes for medium-range edge construction. Though some marginal improvements are observed by using a larger neighborhood size (i.e., more than or equal to 18 neighbors), the image modeling performance on this task is in general insensitive to the semantic neighbor size. By default, EurNet-T uses 12 semantic neighbors (denoted by the gray cell in Tab. 12), which achieves comparable performance with the configurations using more semantic neighbors.

