DIFFORMER: SCALABLE (GRAPH) TRANSFORMERS INDUCED BY ENERGY CONSTRAINED DIFFUSION

Abstract

Real-world data generation often involves complex inter-dependencies among instances, violating the IID-data hypothesis of standard learning paradigms and posing a challenge for uncovering the geometric structures for learning desired instance representations. To this end, we introduce an energy constrained diffusion model which encodes a batch of instances from a dataset into evolutionary states that progressively incorporate other instances' information by their interactions. The diffusion process is constrained by descent criteria w.r.t. a principled energy function that characterizes the global consistency of instance representations over latent structures. We provide rigorous theory that implies closed-form optimal estimates for the pairwise diffusion strength among arbitrary instance pairs, which gives rise to a new class of neural encoders, dubbed as DIFFORMER (diffusion-based Transformers), with two instantiations: a simple version with linear complexity for prohibitive instance numbers, and an advanced version for learning complex structures. Experiments highlight the wide applicability of our model as a general-purpose encoder backbone with superior performance in various tasks, such as node classification on large graphs, semi-supervised image/text classification, and spatial-temporal dynamics prediction.

1. INTRODUCTION

Real-world data are generated from a convoluted interactive process whose underlying physical principles are often unknown. Such a nature violates the common hypothesis of standard representation learning paradigms assuming that data are IID sampled. The challenge, however, is that due to the absence of prior knowledge about ground-truth data generation, it can be practically prohibitive to build feasible methodology for uncovering data dependencies, despite the acknowledged significance. To address this issue, prior works, e.g., Wang et al. ( 2019 2019), consider encoding the potential interactions between instance pairs, but this requires sufficient degrees of freedom that significantly increases learning difficulty from limited labels (Fatemi et al., 2021) and hinders the scalability to large systems (Wu et al., 2022b) . Turning to a simpler problem setting where putative instance relations are instantiated as an observed graph, remarkable progress has been made in designing expressive architectures such as graph neural networks (GNNs) (Scarselli et al., 2008; Kipf & Welling, 2017; Velickovic et al., 2018; Wu et al., 2019; Chen et al., 2020a; Yang et al., 2021) for harnessing inter-connections between instances as a geometric prior (Bronstein et al., 2017) . However, the observed relations can be incomplete/noisy, due to error-prone data collection, or generated by an artificial construction independent from downstream targets. The potential inconsistency between observation and the underlying data geometry would presumably elicit systematic bias between structured representation of graph-based learning and true



); Franceschi et al. (2019); Jiang et al. (2019); Zhang et al. (

funding

Junchi Yan who is also affiliated with Shanghai AI Lab. The work was in part supported by National Key Research and Development Program of China (2020AAA0107600), National Natural Science Foundation of China (62222607), STCSM (22511105100).

availability

The codes are available at https://github.com/qitianwu/DIFFormer.

