ANTI-SYMMETRIC DGN: A STABLE ARCHITECTURE FOR DEEP GRAPH NETWORKS

Abstract

Deep Graph Networks (DGNs) currently dominate the research landscape of learning from graphs, due to their efficiency and ability to implement an adaptive message-passing scheme between the nodes. However, DGNs are typically limited in their ability to propagate and preserve long-term dependencies between nodes, i.e., they suffer from the over-squashing phenomena. This reduces their effectiveness, since predictive problems may require to capture interactions at different, and possibly large, radii in order to be effectively solved. In this work, we present Anti-Symmetric Deep Graph Networks (A-DGNs), a framework for stable and non-dissipative DGN design, conceived through the lens of ordinary differential equations. We give theoretical proof that our method is stable and non-dissipative, leading to two key results: long-range information between nodes is preserved, and no gradient vanishing or explosion occurs in training. We empirically validate the proposed approach on several graph benchmarks, showing that A-DGN leads to improved performance and enables to learn effectively even when dozens of layers are used.

1. INTRODUCTION

Representation learning for graphs has become one of the most prominent fields in machine learning. Such popularity derives from the ubiquitousness of graphs. Indeed, graphs are an extremely powerful tool to represent systems of relations and interactions and are extensively employed in many domains (Battaglia et al., 2016; Gilmer et al., 2017; Zitnik et al., 2018; Monti et al., 2019; Derrow-Pinion et al., 2021) . For example, they can model social networks, molecular structures, protein-protein interaction networks, recommender systems, and traffic networks. The primary challenge in this field is how we capture and encode structural information in the learning model. Common methods used in representation learning for graphs usually employ Deep Graph Networks (DGNs) (Bacciu et al., 2020; Wu et al., 2021) . DGNs are a family of learning models that learn a mapping function that compresses the complex relational information encoded in a graph into an information-rich feature vector that reflects both the topological and the label information in the original graph. As widely popular with neural networks, also DGNs consists of multiple layers. Each of them updates the node representations by aggregating previous node states and their neighbors, following a message passing paradigm. However, in some problems, the exploitation of local interactions between nodes is not enough to learn representative embeddings. In this scenario, it is often the case that the DGN needs to capture information concerning interactions between nodes that are far away in the graph, i.e., by stacking multiple layers. A specific predictive problem typically needs to consider a specific range of node interactions in order to be effectively solved, hence requiring a specific number (possibly large) of DGN layers. Despite the progress made in recent years in the field, many of the proposed methods suffer from the over-squashing problem (Alon & Yahav, 2021) when the number of layers increases. Specifically, when increasing the number of layers to cater for longer-range interactions, one observes an excessive amplification or an annihilation of the information being routed to the node by the message passing process to update its fixed length encoding. As such, over-squashing prevents DGNs to learn long-range information. In this work, we present Anti-Symmetric Deep Graph Network (A-DGN), a framework for effective long-term propagation of information in DGN architectures designed through the lens of ordinary differential equations (ODEs). Leveraging the connections between ODEs and deep neural architectures, we provide theoretical conditions for realizing a stable and non-dissipative ODE system on graphs through the use of anti-symmetric weight matrices. The formulation of the A-DGN layer then results from the forward Euler discretization of the achieved graph ODE. Thanks to the properties enforced on the ODE, our framework preserves the long-term dependencies between nodes as well as prevents from gradient explosion or vanishing. Interestingly, our analysis also paves the way for rethinking the formulation of standard DGNs as discrete versions of non-dissipative and stable ODEs on graphs. The key contributions of this work can be summarized as follows: • We introduce A-DGN, a novel design scheme for deep graph networks stemming from an ODE formulation. Stability and non-dissipation are the main properties that characterize our method, allowing the preservation of long-term dependencies in the information flow. • We theoretically prove that the employed ODE on graphs has stable and non-dissipative behavior. Such result leads to the absence of exploding and vanishing gradient problems during training, typical of unstable and lossy systems. • We conduct extensive experiments to demonstrate the benefits of our method. A-DGN can outperform classical DGNs over several datasets even when dozens of layers are used. The rest of this paper is organized as follows. We introduce the A-DGN framework in Section 2 by theoretically proving its properties. In Section 3, we give an overview of the related work in the field of representation learning for graphs and continuous dynamic models. Afterwards, we provide the experimental assessment of our method in Section 4. Finally, Section 5 concludes the paper.

2. ANTI-SYMMETRIC DEEP GRAPH NETWORK

Recent advancements in the field of representation learning propose to treat neural network architectures as an ensemble of continuous (rather than discrete) layers, thereby drawing connections between deep neural networks and ordinary differential equations (ODEs) (Haber & Ruthotto, 2017; Chen et al., 2018) . This connection can be pushed up to neural processing of graphs as introduced in (Poli et al., 2019) , by making a suitable ODE define the computation on a graph structure. We focus on static graphs, i.e., on structures described by G = (V, E), with V and E respectively denoting the fixed sets of nodes and edges. For each node u ∈ V we consider a state x u (t) ∈ R d , which provides a representation of the node u at time t. We can then define a Cauchy problem on graphs in terms of the following node-wise defined ODE: ∂x u (t) ∂t = f G (x u (t)), for time t ∈ [0, T ], and subject to the initial condition x u (0) = x 0 u ∈ R d . The dynamics of node's representations is described by the function f G : R d → R d , while the initial condition x u (0) can be interpreted as the initial configuration of the node's information, hence as the input for our computational model. As a consequence, the ODE defined in Equation 1 can be seen as a continuous information processing system over the graph, which starting from the input configuration x u (0) computes the final node's representation (i.e., embedding) x u (T ). Notice that this process shares similarities with standard DGNs, in what it computes nodes' states that can be used as an embedded representation of the graph and then used to feed a readout layer in a downstream task on graphs. The top of Figure 1 visually summarizes this concept, showing how nodes evolve following a specific graph ODE in the time span between 0 and a terminal time T > 0.

