EURNET: EFFICIENT MULTI-RANGE RELATIONAL MODELING OF SPATIAL MULTI-RELATIONAL DATA

Abstract

Modeling spatial relationship in the data remains critical across many different tasks, such as image classification, semantic segmentation and protein structure understanding. Previous works often use a unified solution like relative positional encoding. However, there exists different kinds of spatial relations, including short-range, medium-range and long-range relations, and modeling them separately can better capture the focus of different tasks on the multi-range relations (e.g., short-range relations can be important in instance segmentation, while longrange relations should be upweighted for semantic segmentation). In this work, we introduce the EurNet for Efficient multi-range relational modeling. EurNet constructs the multi-relational graph, where each type of edge corresponds to short-, medium-or long-range spatial interactions. In the constructed graph, EurNet adopts a novel modeling layer, called gated relational message passing (GRMP), to propagate multi-relational information across the data. GRMP captures multiple relations within the data with little extra computational cost. We study EurNets in two important domains for image and protein structure modeling. Extensive experiments on ImageNet classification, COCO object detection and ADE20K semantic segmentation verify the gains of EurNet over the previous SoTA FocalNet. On the EC and GO protein function prediction benchmarks, EurNet consistently surpasses the previous SoTA GearNet. Our results demonstrate the strength of EurNets on modeling spatial multi-relational data from various domains.

1. INTRODUCTION

This work studies the data that lie in the 2D/3D space and incorporate interacting relations on different spatial ranges. A representative example is the image data, where an object in the image can interact with other adjacent objects via the direct touch, and it can also interact with those distantly relevant ones via gazing, waving hands or pointing. In protein science, the protein 3D structure is another typical example, in which different amino acids can interact in short range by peptide/hydrogen bonds, and they can also interact in medium and long ranges by hydrophobic interaction. We summarize such kind of data as spatial multi-relational data. In various domains, a lot of previous efforts have been made to model the spatial multi-relational data. For image modeling, multi-head self-attention mechanisms (Dosovitskiy et al., 2020; Liu et al., 2021b) , convolutional operations with large receptive fields (Ding et al., 2022; Yang et al., 2022) and MLPs for mixing full spatial information (Tolstikhin et al., 2021; Touvron et al., 2021a) are explored to capture multi-range spatial interactions within an image. For protein structure modeling, Zhang et al. (2022) builds multiple groups of edges for different short-range interactions and employs relational graph convolution (Schlichtkrull et al., 2018) for multi-relational modeling. These works either implicitly treat different kinds of spatial relations (i.e., short-range, medium-range and longrange relations) (Tolstikhin et al., 2021; Yang et al., 2022) or handle them by a unified scheme like relative positional encoding (Dosovitskiy et al., 2020; Liu et al., 2021b) . However, considering the relative importance of these spatial relations could vary across different tasks (e.g., the great importance of short-range relations in instance segmentation, and the upgraded importance of long-range relations in semantic segmentation), separately modeling each spatial relation is a better solution to capture different tasks' focus. Such a separate modeling approach remains to be explored, and, especially, the approach is expected to have efficient adaptation to large data and model scales. To attain this goal, we propose the EurNet for Efficient multi-range relational modeling. In general, EurNets are a series of relational graph neural networks equipped with graph construction layers, where relational edges are constructed by the layers for capturing multi-range spatial interactions. When instantiated with different domain knowledge (e.g., computer vision or protein science), Eur-Nets can be specialized to tackle important problems like image classification, image segmentation and protein function prediction. To be specific, upon the raw data, EurNet first uses the graph construction layers to build different types of edges that respectively capture the short-, medium-and long-range spatial interactions within the data. For efficient multi-relational modeling over the constructed graph, we next introduce the gated relational message passing (GRMP) layer as the basic modeling module of EurNet. GRMP separately performs (1) relational message aggregation on each individual feature channel and (2) node-wise aggregation of different feature channels. Compared to the classical relational graph convolution (RGConv) (Schlichtkrull et al., 2018) , GRMP enjoys lower computational cost when more relations are to be modeled, and thus can handle more types of spatial interactions given the same computational budget. EurNet also supports dynamic graph construction and multi-stage modeling that are used in domains like image modeling. 2022). Under this fixarchitecture comparison, EurNet consistently outperforms the SoTA GearNet on standard protein function prediction benchmarks in terms of protein-centric maximum F-score (EC: 0.768 v.s. 0.730; GO-BP: 0.437 v.s. 0.356; GO-MF: 0.563 v.s. 0.503; GO-CC: 0.421 v.s. 0.414). These performance improvements remain when edge-level message passing is involved. Our results demonstrate that EurNet could be a strong candidate for modeling spatial multi-relational data in various domains.

2. RELATED WORK

Multi-relational data modeling. Multi-relational data are ubiquitous in the real world, e.g., knowledge graphs (Toutanova & Chen, 2015) and customer-product networks (Li et al., 2014) . To effectively model multiple types of relations/interactions, existing works have explored embedding-based methods (Bordes et al., 2013; Sun et al., 2019 ), multi-headed attention (Vaswani et al., 2017) and different relational graph neural networks (GNNs) (Schlichtkrull et al., 2018; Vashishth et al., 2019; Busbridge et al., 2019; Zhu et al., 2021) . Previous relational GNNs mainly focus on model expressivity and parameter efficiency, and few works (Li et al., 2021) study the computational efficiency for relational modeling at scale. In addition, they can hardly model the spatial multi-relational data whose relational linking structures at different spatial ranges are not originally given (e.g., image patches). EurNet is designed to model such kind of data in a computationally efficient way. Image modeling. After the dominance of convolutional vision backbones (He et al., 2016; Tan & Le, 2019) in 2010s, researchers rethink the architectures for more effective image modeling in 2020s. Vision Transformers (Dosovitskiy et al., 2020; Liu et al., 2021b; Wang et al., 2021) replace convolutions with the self-attention mechanism (Vaswani et al., 2017) to better capture non-local interactions and gain SoTA performance. Following such successes, modern convolutional architectures (Liu et al., 2022; Yang et al., 2022 ), all-MLP architectures (Tolstikhin et al., 2021; Touvron et al., 2021a) and vision GNNs (Han et al., 2022) are designed to aggregate long-range spatial context. Some earlier works (Chen et al., 2019b; Zhang et al., 2019; 2020) realize non-local modeling by graph convolution on fully-connected or dynamic graphs. By comparison, EurNet captures multirange spatial interactions from a novel graph learning perspective, i.e., multi-relational modeling. Protein structure modeling. A variety of protein structure encoders have been developed to acquire informative protein representations on different structural granularity, including residue-level structures (Gligorijević et al., 2021; Zhang et al., 2022 ), atom-level structures (Jing et al., 2021; Hermosilla et al., 2021) and protein surfaces (Gainza et al., 2020; Sverrisson et al., 2021) . This work focuses on the residue-level protein structure modeling. GearNet (Zhang et al., 2022) is a closely



We demonstrate EurNets in image and protein structure modeling. To model image patches with different granularity, we build EurNets with hierarchical graph construction layers and multiple modeling stages and derive a model series with increasing capacity, i.e., EurNet-T, EurNet-S and EurNet-B. These models enjoy comparable or better top-1 accuracy (82.3% v.s. 82.3%; 83.6% v.s. 83.5%; 84.1% v.s. 83.9%) against the previous SoTA FocalNet (LRF) series (Yang et al., 2022) on ImageNet-1K classification (resolution: 224 × 224). Similar performance gains are preserved on COCO object detection and ADE20K semantic segmentation. To model protein alpha carbons, we build EurNet with a single-stage model architecture as GearNet Zhang et al. (

