DIFFORMER: SCALABLE (GRAPH) TRANSFORMERS INDUCED BY ENERGY CONSTRAINED DIFFUSION

Abstract

Real-world data generation often involves complex inter-dependencies among instances, violating the IID-data hypothesis of standard learning paradigms and posing a challenge for uncovering the geometric structures for learning desired instance representations. To this end, we introduce an energy constrained diffusion model which encodes a batch of instances from a dataset into evolutionary states that progressively incorporate other instances' information by their interactions. The diffusion process is constrained by descent criteria w.r.t. a principled energy function that characterizes the global consistency of instance representations over latent structures. We provide rigorous theory that implies closed-form optimal estimates for the pairwise diffusion strength among arbitrary instance pairs, which gives rise to a new class of neural encoders, dubbed as DIFFORMER (diffusion-based Transformers), with two instantiations: a simple version with linear complexity for prohibitive instance numbers, and an advanced version for learning complex structures. Experiments highlight the wide applicability of our model as a general-purpose encoder backbone with superior performance in various tasks, such as node classification on large graphs, semi-supervised image/text classification, and spatial-temporal dynamics prediction.

1. INTRODUCTION

Real-world data are generated from a convoluted interactive process whose underlying physical principles are often unknown. Such a nature violates the common hypothesis of standard representation learning paradigms assuming that data are IID sampled. The challenge, however, is that due to the absence of prior knowledge about ground-truth data generation, it can be practically prohibitive to build feasible methodology for uncovering data dependencies, despite the acknowledged significance. To address this issue, prior works, e.g., Wang et al. ( 2019 2019), consider encoding the potential interactions between instance pairs, but this requires sufficient degrees of freedom that significantly increases learning difficulty from limited labels (Fatemi et al., 2021) and hinders the scalability to large systems (Wu et al., 2022b) . Turning to a simpler problem setting where putative instance relations are instantiated as an observed graph, remarkable progress has been made in designing expressive architectures such as graph neural networks (GNNs) (Scarselli et al., 2008; Kipf & Welling, 2017; Velickovic et al., 2018; Wu et al., 2019; Chen et al., 2020a; Yang et al., 2021) for harnessing inter-connections between instances as a geometric prior (Bronstein et al., 2017) . However, the observed relations can be incomplete/noisy, due to error-prone data collection, or generated by an artificial construction independent from downstream targets. The potential inconsistency between observation and the underlying data geometry would presumably elicit systematic bias between structured representation of graph-based learning and true Figure 1: An illustration of the general idea behind DIFFORMER which takes a whole dataset (or a batch) of instances as input and encodes them into hidden states through a diffusion process aimed at minimizing a regularized energy. This design allows feature propagation among arbitrary instance pairs at each layer with optimal inter-connecting structures for informed prediction on each instance. data dependencies. While a plausible remedy is to learn more useful structures from the data, this unfortunately brings the previously-mentioned obstacles to the fore. To resolve the dilemma, we propose a novel general-purpose encoder framework that uncovers data dependencies from observations (a dataset of partially labeled instances), proceeding via two-fold inspiration from physics as illustrated in Fig. 1 . Our model is defined through feed-forward continuous dynamics (i.e., a PDE) involving all the instances of a dataset as locations on Riemannian manifolds with latent structures, upon which the features of instances act as heat flowing over the underlying geometry (Hamzi & Owhadi, 2021) . Such a diffusion model serves an important inductive bias for leveraging global information from other instances to obtain more informative representations. Its major advantage lies in the flexibility for the diffusivity function, i.e., a measure of the rate at which information spreads (Rosenberg & Steven, 1997): we allow for feature propagation between arbitrary instance pairs at each layer, and adaptively navigate this process by pairwise connectivity weights. Moreover, for guiding the instance representations towards some ideal constraints of internal consistency, we introduce a principled energy function that enforces layer-wise regularization on the evolutionary directions. The energy function provides another view (from a macroscopic standpoint) into the desired instance representations with low energy that are produced, i.e., soliciting a steady state that gives rise to informed predictions on unlabeled data. As a justification for the tractability of above general methodology, our theory reveals the underlying equivalence between finite-difference iterations of the diffusion process and unfolding the minimization dynamics for an associated regularized energy. This result further suggests a closed-form optimal solution for the diffusivity function that updates instance representations by the ones of all the other instances towards giving a rigorous decrease of the global energy. Based on this, we also show that the energy constrained diffusion model can serve as a principled perspective for unifying popular models like MLP, GCN and GAT which can be viewed as special cases of our framework. On top of the theory, we propose a new class of neural encoders, Diffusion-based Transformers (DIFFORMER), and its two practical instantiations: one is a simple version with O(N ) complexity (N for instance number) for computing all-pair interactions among instances; the other is a more expressive version that can learn complex latent structures. We empirically demonstrate the success of DIFFORMER on a diverse set of tasks. It outperforms SOTA approaches on semi-supervised node classification benchmarks and performs competitively on large-scale graphs. It also shows promising power for image/text classification with low label rates and predicting spatial-temporal dynamics.

2. RELATED WORK

Graph-based Semi-supervised Learning. Graph-based SSL (Kipf & Welling, 2017) aims to learn from partially labeled data, where instances are treated as nodes and their relations are given by a graph. The observed structure can be leveraged as regularization for learning representations (Belkin et al., 2006; Weston et al., 2012; Yang et al., 2016) or as an inductive bias of modern GNN architectures (Scarselli et al., 2008) . However, there frequently exist situations where the observed structure is unavailable or unreliable (Franceschi et al., 2019; Jiang et al., 2019; Chen et al., 2020c; Fatemi et al., 2021; Lao et al., 2022) , in which case the challenge remains how to uncover the underlying



); Franceschi et al. (2019); Jiang et al. (2019); Zhang et al. (

funding

Junchi Yan who is also affiliated with Shanghai AI Lab. The work was in part supported by National Key Research and Development Program of China (2020AAA0107600), National Natural Science Foundation of China (62222607), STCSM (22511105100).

availability

The codes are available at https://github.com/qitianwu/DIFFormer.

