DIGRESS: DISCRETE DENOISING DIFFUSION FOR GRAPH GENERATION

Abstract

This work introduces DiGress, a discrete denoising diffusion model for generating graphs with categorical node and edge attributes. Our model utilizes a discrete diffusion process that progressively edits graphs with noise, through the process of adding or removing edges and changing the categories. A graph transformer network is trained to revert this process, simplifying the problem of distribution learning over graphs into a sequence of node and edge classification tasks. We further improve sample quality by introducing a Markovian noise model that preserves the marginal distribution of node and edge types during diffusion, and by incorporating auxiliary graph-theoretic features. A procedure for conditioning the generation on graph-level features is also proposed. DiGress achieves state-of-theart performance on molecular and non-molecular datasets, with up to 3x validity improvement on a planar graph dataset. It is also the first model to scale to the large GuacaMol dataset containing 1.3M drug-like molecules without the use of molecule-specific representations.

1. INTRODUCTION

Denoising diffusion models (Sohl-Dickstein et al., 2015; Ho et al., 2020) form a powerful class of generative models. At a high-level, these models are trained to denoise diffusion trajectories, and produce new samples by sampling noise and recursively denoising it. Diffusion models have been used successfully in a variety of settings, outperforming all other methods on image and video (Dhariwal & Nichol, 2021; Ho et al., 2022) . These successes raise hope for building powerful models for graph generation, a task with diverse applications such as molecule design (Liu et al., 2018) , traffic modeling (Yu & Gu, 2019), and code completion (Brockschmidt et al., 2019) . However, generating graphs remains challenging due to their unordered nature and sparsity properties. Previous diffusion models for graphs proposed to embed the graphs in a continuous space and add Gaussian noise to the node features and graph adjacency matrix (Niu et al., 2020; Jo et al., 2022) . This however destroys the graph's sparsity and creates complete noisy graphs for which structural information (such as connectivity or cycle counts) is not defined. As a result, continuous diffusion can make it difficult for the denoising network to capture the structural properties of the data. In this work, we propose DiGress, a discrete denoising diffusion model for generating graphs with categorical node and edge attributes. Our noise model is a Markov process consisting of successive graphs edits (edge addition or deletion, node or edge category edit) that can occur independently on each node or edge. To invert this diffusion process, we train a graph transformer network to predict the clean graph from a noisy input. The resulting architecture is permutation equivariant and admits an evidence lower bound for likelihood estimation. We then propose several algorithmic enhancements to DiGress, including utilizing a noise model that preserves the marginal distribution of node and edge types during diffusion, introducing a novel guidance procedure for conditioning graph generation on graph-level properties, and augmenting the input of our denoising network with auxiliary structural and spectral features. These features, derived from the noisy graph, aid in overcoming the limited representation power of graph neural networks (Xu et al., 2019) . Their use is made possible by the discrete nature of our noise model, which, in contrast to Gaussian-based models, preserves sparsity in the noisy graphs. These improvements enhance the performance of DiGress on a wide range of graph generation tasks. Our experiments demonstrate that DiGress achieve state-of-the-art performance, generating a high rate of realistic graphs while maintaining high degree of diversity and novelty. On the large MOSES and GuacaMol molecular datasets, which were previously too large for one-shot models, it notably matches the performance of autoregressive models trained using expert knowledge.

2. DIFFUSION MODELS

In this section, we introduce the key concepts of denoising diffusion models that are agnostic to the data modality. These models consist of two main components: a noise model and a denoising neural network. The noise model q progressively corrupts a data point x to create a sequence of increasingly noisy data points (z 1 , . . . , z T ). It has a Markovian structure, where q(z 1 , . . . , z T |x) = q(z 1 |x) T t=2 q(z t |z t-1 ). The denoising network ϕ θ is trained to invert this process by predicting z t-1 from z t . To generate new samples, noise is sampled from a prior distribution and then inverted by iterative application of the denoising network. While early models would directly predict z t-1 from z t (Sohl-Dickstein et al., 2015), these models were difficult to train due to the dependence of z t-1 on the sampled diffusion trajectories. Ho et al. (2020) considerably improved performance by establishing a connection with score-based models (Song & Ermon, 2019) . They showed that when q(z t-1 |z t , x)dp θ (x) is tractable, x can be used as the target of the denoising network, which removes an important source of label noise. For a diffusion model to be efficient, three properties are required: 1. The distribution q(z t |x) should have a closed-form formula, to allow for parallel training on different time steps. 2. The posterior p θ (z t-1 |z t ) = q(z t-1 |z t , x)dp θ (x) should have a closed-form expression, so that x can be used as the target of the neural network. 3. The limit distribution q ∞ = lim T →∞ q(z T |x) should not depend on x, so that we can use it as a prior distribution for inference. These properties are all satisfied when the noise is Gaussian. When the task requires to model categorical data, Gaussian noise can still be used by embedding the data in a continuous space with a one-hot encoding of the categories (Niu et al., 2020; Jo et al., 2022) . We develop in Appendix A a graph generation model based on this principle, and use it for ablation studies. However, Gaussian noise is a poor noise model for graphs as it destroys sparsity as well as graph theoretic notions such as connectivity. Discrete diffusion therefore seems more appropriate to graph generation tasks.  (z t |z t-1 ) = z t-1 Q t . As the process is Markovian, the transition matrix from x to z t reads Qt = Q 1 Q 2 ...Q t . As long as Qt is precomputed or has a closed-form expression, the noisy states z t can be built from x using q(z t |x) = x Qt without having to apply noise recursively (Property 1). The posterior distribution q(z t-1 |z t , x) can also be computed in closed-form using Bayes rule (Property 2): q(z t-1 |z t , x) ∝ z t (Q t ) ′ ⊙ x Qt-1 where ⊙ denotes a pointwise product and Q ′ is the transpose of Q (derivation in Appendix D). Finally, the limit distribution of the noise model depends on the transition model.  Q t = α t I + (1 -α t )1 d 1 ′ d /d with α t transitioning from 1 to 0. When lim t→∞ α t = 0, q(z t |x) converges to a uniform distribution independently of x (Property 3). During inference, the predicted distribution is combined with q(G t-1 |G, G t ) in order to compute p θ (G t-1 |G t ) and sample a discrete G t-1 from this product of categorical distributions. The above framework satisfies all three properties in a setting that is inherently discrete. However, while it has been applied successfully to several data modalities, graphs have unique challenges that need to be considered: they have varying sizes, permutation equivariance properties, and to this date no known tractable universal approximator. In the next sections, we therefore propose a new discrete diffusion model that addresses the specific challenges of graph generation.

3. DISCRETE DENOISING DIFFUSION FOR GRAPH GENERATION (DIGRESS)

In this section, we present the Discrete Graph Denoising Diffusion model (DiGress) for graph generation. Our model handles graphs with categorical node and edge attributes, represented by the spaces X and E, respectively, with cardinalities a and b. We use x i to denote the attribute of node i and x i ∈ R a to denote its one-hot encoding. These encodings are organised in a matrix X ∈ R n×a where n is the number of nodes. Similarly, a tensor E ∈ R n×n×b groups the one-hot encoding e ij of each edge, treating the absence of edge as a particular edge type. We use A ′ to denote the matrix transpose of A, while A T is the value of A at time T .

3.1. DIFFUSION PROCESS AND INVERSE DENOISING ITERATIONS

Similarly to diffusion models for images, which apply noise independently on each pixel, we diffuse separately on each node and edge feature. As a result, the state-space that we consider is not that of graphs (which would be too large to build a transition matrix), but only the node types X and edge types E. For any node (resp. edge), the transition probabilities are defined by the matrices [Q t X ] ij = q(x t = j|x t-1 = i) and [Q t E ] ij = q(e t = j|e t-1 = i). Adding noise to form G t = (X t , E t ) simply means sampling each node and edge type from a categorical distribution defined by: q(G t |G t-1 ) = (X t-1 Q t X , E t-1 Q t E ) and q(G t |G) = (X Qt X , E Qt E ) for Qt X = Q 1 X ...Q t X and Qt E = Q 1 E ...Q t E . When considering undirected graphs, we apply noise only to the upper-triangular part of E and then symmetrize the matrix. The second component of the DiGress model is the denoising neural network ϕ θ parametrized by θ. It takes a noisy graph G t = (X t , E t ) as input and aims to predict the clean graph G, as illustrated in Figure 1 . To train ϕ θ , we optimize the cross-entropy loss l between the predicted probabilities pG = (p X , pE ) for each node and edge and the true graph G: l(p G , G) = 1≤i≤n cross-entropy(x i , pX i ) + λ 1≤i,j≤n cross-entropy(e ij , pE ij ) where λ ∈ R + controls the relative importance of nodes and edges. It is noteworthy that, unlike architectures like VAEs which solve complex distribution learning problems that sometimes requires graph matching, our diffusion model simply solves classification tasks on each node and edge. Once the network is trained, it can be used to sample new graphs. To do so, we need to estimate the reverse diffusion iterations p θ (G t-1 |G t ) using the network prediction pG . We model this distribu-tion as a product over nodes and edges: p θ (G t-1 |G t ) = 1≤i≤n p θ (x t-1 i |G t ) 1≤i,j≤n p θ (e t-1 ij |G t ) To compute each term, we marginalize over the network predictions: p θ (x t-1 i |G t ) = xi p θ (x t-1 i | x i , G t ) dp θ (x i |G t ) = x∈X p θ (x t-1 i | x i = x, G t ) pX i (x) where we choose p θ (x t-1 i | x i = x, G t ) = q(x t-1 i | x i = x, x t i ) if q(x t i |x i = x) > 0 0 otherwise. Similarly, we have p θ (e t-1 ij |e t ij ) = e∈E p θ (e t-1 ij | e ij = e, e t ij ) pE ij (e). These distributions are used to sample a discrete G t-1 that will be the input of the denoising network at the next time step. These equations can also be used to compute an evidence lower bound on the likelihood, which allows for easy comparison between models. The computations are provided in Appendix C.

3.2. DENOISING NETWORK PARAMETRIZATION

The denoising network takes a noisy graph G t = (X, E) as input and outputs tensors X ′ and E ′ which represent the predicted distribution over clean graphs. To efficiently store information, our layers also manipulate graph-level features y. We chose to extend the graph transformer network proposed by Dwivedi & Bresson (2021), as attention mechanisms are a natural model for edge prediction. Our model is described in details in Appendix B.1. At a high-level, it first updates node features using self-attention, incorporating edge features and global features using FiLM layers (Perez et al., 2018) . The edge features are then updated using the unnormalized attention scores, and the graph-level features using pooled node and edge features. Our transformer layers also feature residual connections and layer normalization. To incorporate time information, we normalize the timestep to [0, 1] and treat it as a global feature inside y. The overall memory and time complexity of our network is Θ(n 2 ) per layer, due to the attention scores and the predictions for each edge.

3.3. EQUIVARIANCE PROPERTIES

Graphs are invariant to reorderings of their nodes, meaning that n! matrices can represent the same graph. To efficiently learn from these data, it is crucial to devise methods that do not require augmenting the data with random permutations. This implies that gradient updates should not change if the train data is permuted. To achieve this property, two components are needed: a permutation equivariant architecture and a permutation invariant loss. DiGress satisfies both properties. 3)) that can be decomposed as i l X (p X i , x i ) + i,j l E (p E ij , e ij ) for two functions l X and l E computed respectively on each node and each edge is permutation invariant. Lemma 3.2 shows that our model does not require matching the predicted and target graphs, which would be difficult and costly. This is because the diffusion process keeps track of the identity of the nodes at each step, it can also be interpreted as a physical process where points are distinguishable. Equivariance is however not a sufficient for likelihood computation: in general, the likelihood of a graph is the sum of the likelihood of all its permutations, which is intractable. To avoid this computation, we can make sure that the generated distribution is exchangeable, i.e., that all permutations of generated graphs are equally likely (Köhler et al., 2020) . Lemma 3.3. (Exchangeability) DiGress yields exchangeable distributions, i.e., it generates graphs with node features X and adjacency matrix A that satisfy P(X, A) = P (π T X, π T Aπ) for any permutation π. 

4. IMPROVING DIGRESS WITH MARGINAL PROBABILITIES AND STRUCTURAL FEATURES 4.1 CHOICE OF THE NOISE MODEL

The choice of the Markov transition matrices (Q t ) t≤T defining the graph edit probabilities is arbitrary, and it is a priori not clear what noise model will lead to the best performance. A common model is a uniform transition over the classes Q t = α t I + (1 -α t )(1 d 1 ′ d ) /d, which leads to limit distributions q X and q E that are uniform over categories. Graphs are however usually sparse, meaning that the marginal distribution of edge types is far from uniform. Starting from uniform noise, we observe in Figure 2 that it takes many diffusion steps for the model to produce a sparse graph. To improve upon uniform transitions, we propose the following hypothesis: using a prior distribution which is close to the true data distribution makes training easier. This prior distribution cannot be chosen arbitrarily, as it needs to be permutation invariant to satisfy exchangeability (Lemma 3.3). A natural model for this distribution is therefore a product i u × i,j v of a single distribution u for all nodes and a single distribution v for all edges. We propose the following result (proved in Appendix D) to guide the choice of u and v: Theorem 4.1. (Optimal prior distribution) Consider the class C = { i u × i,j v, (u, v) ∈ P(X ) × P (E)} of distributions over graphs that factorize as the product of a single distribution u over X for the nodes and a single distribution v over E for the edges. Let P be an arbitrary distribution over graphs (seen as a tensor of order n+n 2 ) and m X , m E its marginal distributions of node and edge types. Then π G = i m X × i,j m E is the orthogonal projection of P on C: π G ∈ arg min (u,v)∈C || P - 1≤i≤n u × 1≤i,j≤n v|| 2 2 This result means that to get a prior distribution q X × q E close to the true data distribution, we should define transition matrices such that ∀i, lim T →∞ QT X 1 i = m X (and similarly for edges). To achieve this property, we propose to use Q t X = α t I + β t 1 a m ′ X and Q t E = α t I + β t 1 b m ′ E ( ) With this model, the probability of jumping from a state i to a state j is proportional to the marginal probability of category j in the training set. Since (1m ′ ) 2 = 1m ′ , we still have Qt = ᾱt I + βt 1m ′ for ᾱt = t τ =1 α τ and βt = t τ =1 (1 -α τ ). We follow the popular cosine schedule ᾱt = cos (0.5π(t/T + s)/(1 + s)) 2 with a small s. Experimentally, these marginal transitions improves over uniform transitions (Appendix F).

4.2. STRUCTURAL FEATURES AUGMENTATION

Generative models for graphs inherit the limitations of graph neural networks, and in particular their limited representation power (Xu et al., 2019; Morris et al., 2019) . One example of this limitation is the difficulty for standard message passing networks (MPNNs) to detect simple substructures DiGress operates on a discrete space and its noisy graphs are not complete, allowing for the computation of various graph descriptors at each diffusion step. These descriptors can be input to the network to aid in the denoising process, resulting in Algorithms 1 and 2 for training DiGress and sampling from it. The inclusion of these additional features experimentally improves performance, but they are not required for building a good model. The choice of which features to include and the computational complexity of their calculation should be considered, especially for larger graphs. The details of the features used in our experiments can be found in the Appendix B.1.

Algorithm 1: Training DiGress

Input: A graph G = (X, E) Sample t ∼ U(1, ..., T ) Sample G t ∼ X Qt X × E Qt E ▷ Sample a (discrete) noisy graph z ← f (G t , t) ▷ Structural and spectral features pX , pE ← ϕ θ (G t , z) ▷ Forward pass optimizer. step(l CE (p X , X) + λ l CE (p E , E)) ▷ Cross-entropy Algorithm 2: Sampling from DiGress Sample n from the training data distribution Sample G T ∼ q X (n) × q E (n) ▷ Random graph for t = T to 1 do z ← f (G t , t) ▷ Structural and spectral features pX , pE ← ϕ θ (G t , z) ▷ Forward pass p θ (x t-1 i |G t ) ← x q(x t-1 i |x i = x, x t i ) pX i (x) i ∈ 1, . . . , n ▷ Posterior p θ (e t-1 ij |G t ) ← e q(e t-1 ij |e ij = e, e t ij ) pE ij (e) i, j ∈ 1, . . . , n G t-1 ∼ i p θ (x t-1 i |G t ) ij p θ (e t-1 ij |G t ) ▷ Categorical distr. end return G 0

5. CONDITIONAL GENERATION

While good unconditional generation is a prerequisite, the ability to condition generation on graphlevel properties is crucial for many applications. For example, in drug design, molecules that are easy to synthesize and have high activity on specific targets are of particular interest. One way to perform conditional generation is to train the denoising network using the target properties (Hoogeboom et al., 2022) , but it requires to retrain the model when the conditioning properties changes. To overcome this limitation, we propose a new discrete guidance scheme inspired by the classifier guidance algorithm (Sohl-Dickstein et al., 2015). Our method uses a regressor g η which is trained to predict target properties y G of a clean graph G from a noisy version of G: g η (G t ) = ŷ.This regressor guides the unconditional diffusion model ϕ θ by modulating the predicted distribution at each sampling step and pushing it towards graphs with the desired properties. The equations for the conditional denoising process are given by the following lemma: Lemma 5.1. (Conditional reverse noising process) (Dhariwal & Nichol, 2021) Denote q the noising process conditioned on y G , q the unconditional noising process, and assume that q(G t |G, y G ) = q(G t |G). Then we have q(G t-1 |G t , y G ) ∝ q(G t-1 |G t ) q(y G |G t-1 ). While we would like to estimate q(G t-1 |G t ) q(y G |G t-1 ) by p θ (G t-1 |G t ) p η (y G |G t-1 ), p η cannot be evaluated for all possible values of G t-1 . To overcome this issue, we view G as a continuous Algorithm 3: Sampling from DiGress with discrete regressor guidance. Input: Unconditional model ϕ θ , property regressor g, target y, guidance scale λ, graph size n Sample G T ∼ q X (n) × q E (n) ▷ Random graph for t = T to 1 do z ← f (G t , t) ▷ Structural and spectral features pX , pE ← ϕ θ (G t , z) ▷ Forward pass ŷ ← g η (G t ) ▷ Regressor model p η ( ŷ|G t-1 ) ∝ exp(-λ ⟨∇ G t || ŷ -y|| 2 , G t-1 ⟩) ▷ Guidance distribution Sample G t-1 ∼ p θ (G t-1 |G t ) p η ( ŷ|G t-1 ) ▷ Reverse process end return G 0 tensor of order n + n 2 (so that ∇ G can be defined) and use a first-order approximation. It gives: log q(y G |G t-1 ) ≈ log q(y G |G t ) + ⟨∇ G log q(y G |G t ), G t-1 -G t ⟩ ≈ c(G t ) + 1≤i≤n ⟨∇ xi log q(y G |G t ), x t-1 i ⟩ + 1≤i,j≤n ⟨∇ eij log q(y G |G t ), e t-1 ij ⟩ for a function c that does not depend on G t-1 . We make the additional assumption that q(y G |G t ) = N (g(G t ), σ y I), where g is estimated by g η , so that ∇ G t log qη (y|G t ) ∝ -∇ G t || ŷ -y G || 2 . The resulting procedure is presented in Algorithm 3. In addition to being conditioned on graph-level properties, our model can be used to extend an existing subgraph -a task called molecular scaffold extension in the drug discovery literature (Maziarz et al., 2022) . In Appendix E, we explain how to do it and demonstrate it on a simple example. . We evaluate the ability to correctly model various properties of these graphs, such as whether the generated graphs are statistically distinguishable from the SBM model or if they are planar and connected. We refer to Appendix F for a description of the metrics. In Table 1 , we observe that DiGress is able to capture the data distribution very effectively, with significant improvements over baselines on planar graphs. In contrast, our continuous model, ConGress, performs poorly on these relatively large graphs.

7.2. SMALL MOLECULE GENERATION

We then evaluate our model on the standard QM9 dataset (Wu et al., 2018 ) that contains molecules with up to 9 heavy atoms. We use a split of 100k molecules for training, 20k for validation and 13k for evaluating likelihood on a test set. We report the negative log-likelihood of our model, validity (measured by RDKit sanitization) and uniqueness over 10k molecules. Novelty results are discussed in Appendix F. 95% confidence intervals are reported based on five runs. Results are presented in Figure 2 . Since ConGress and DiGress both obtain close to perfect metrics on this dataset, we also perform an ablation study on a more challenging version of QM9 where hydrogens are explicitly modeled in Appendix F. It shows that the discrete framework is beneficial and that marginal transitions and auxiliary features further boost performance. To measure the ability of DiGress to condition the generation on graph-level properties, we propose a conditional generation setting on QM9. We sample 100 molecules from the test set and retrieve their dipole moment µ and the highest occupied molecular orbit (HOMO). The pairs (µ, HOMO) constitute the conditioning vector that we use to generate 10 molecules. To evaluate the ability of a model to condition correctly, we need to estimate the properties of the generated samples. To do so, we use RdKit (Landrum et al., 2006) to produce conformers of the generated graphs, and then Psi4 (Smith et al., 2020) to estimate the values of µ and HOMO. We report the mean absolute error between the targets and the estimated values for the generated molecules (Fig. 4 ).

7.4. MOLECULE GENERATION AT SCALE

We finally evaluate our model on two much more challenging datasets made of more than a million molecules: MOSES (Polykovskiy et al., 2020), which contains small drug-like molecules, and Gua-caMol (Brown et al., 2019) , which contains larger molecules. DiGress is to our knowledge the first one-shot generative model that is not based on molecular fragments and that scales to datasets of this size. The metrics used as well as additional experiments are presented in App. F. For MOSES, the reported scores for FCD, SNN, and Scaffold similarity are computed on the dataset made of separate scaffolds, which measures the ability of the networks to predict new ring structures. Results are presented in Tables 3 and 4 : they show that DiGress does not yet match the performance of SMILES and fragment-based methods, but performs on par with GraphInvent, an autoregressive model fine-tuned using chemical softwares and reinforcement learning. DiGress thus bridges the important gap between one-shot methods and autoregressive models that previously prevailed.

8. CONCLUSION

We proposed DiGress, a denoising diffusion model for graph generation that operates on a discrete space. DiGress outperforms existing one-shot generation methods and scales to larger molecular datasets, reaching the performance of autogressive models trained using expert knowledge.

A CONTINUOUS GRAPH DENOISING DIFFUSION MODEL (CONGRESS)

In this section we present a diffusion model for graphs that uses Gaussian noise rather than a discrete diffusion process. Its denoising network is the same as the one of our discrete model. Our goal is to show that the better performance obtained with DiGress is not only due to the neural network design, but also to the discrete process itself.

A.1 DIFFUSION PROCESS

Consider a graph G = (X, E). Similarly to the discrete diffusion model, this diffusion process adds noise independently on each node and each edge, but this time the noise considered is Gaussian: q(X t |X t-1 ) = N (α t|t-1 X t-1 , (σ t|t-1 ) 2 I) and q(E t |E t-1 ) = N (α t|t-1 E t-1 , (σ t|t-1 ) 2 I) ) This process can equivalently be written: q(X t |X) = N (X t |α t X, σ t I) q(E t |E) = N (E t |α t E, σ t I) where α t|t-1 = α t /α t-1 and (σ t|t-1 ) 2 = (σ t ) 2 -(α t|t-1 ) 2 (σ t-1 ) 2 . The variance is chosen as (σ t ) 2 = 1 -(α t ) 2 in order to obtain a variance-preserving process (Kingma et al., 2021) . Similarly to DiGress, when we consider undirected graphs, we only apply the noise on the upper-triangular part of E without the main diagonal, and then symmetrize the matrix. The true denoising process can be computed in closed-form: q(X t-1 |X, X t ) = N (µ t→t-1 (X, X t ), (σ t→t-1 ) 2 I) (and similarly for E), (9) with µ t→t-1 (X, X t ) = α t|t-1 (σ t-1 ) 2 σ 2 t X t + α t-1 (σ t|t-1 ) 2 (σ t ) 2 X and σ t→t-1 = σ t|t-1 σ t-1 σ t . As commonly done for Gaussian diffusion models, we train the denoising network to predict the noise components εX , εE instead of X and Ê themselves (Ho et al., 2020) . Both relate as follows: α t X = X t -σ t εX and αt E = E t -σ t εE (11) To optimize the network, we minimize the mean squared error between the predicted noise and the true noise, which results in Algorithm 4 for training ConGress. Sampling is done similarly to standard Gaussian diffusion models, except for the last step: since continuous valued features are obtained, they must be mapped back to categorical values in order to obtain a discrete graph. For this purpose, we then take the argmax of X 0 , E 0 across node and edge types (Algorithm 5). Overall, ConGress is very close to the GDSS model proposed in Jo et al. ( 2022), as it is also a Gaussian-based diffusion model for graphs. An important difference is that we define a diffusion process that is independent for each node and edge, while GDSS uses a more complex noise model that does not factorize. We observe empirically that a simple noise model does not hurt performance, since ConGress outperforms GDSS on QM9 (Table 2 ). The parametrization of our denoising network is presented in Figure 5 . It takes as input a noisy graph (X, E) and predicts a distribution over the clean graphs. We compute structural and spectral features in order to improve the network expressivity. Internally, each layer manipulates nodes features X, edge features E but also graph level features y. Each graph transformer layer is made of a graph attention module (presented in Figure 6 ), as well as a fully-connected layers and layer normalization.  . FiLM(M 1 , M 2 ) = M 1 W 1 + (M 1 W 2 ) ⊙ M 2 + M 2 for learnable weight matrices W 1 and W 2 , and PNA(X) = cat(max(X), min(X), mean(X), std(X)) W .

B.2 AUXILIARY STRUCTURAL AND SPECTRAL FEATURES

The structural features that we use can be divided in two types: graphtheoretic (cycles and spectral features) and domain specific (molecular features). Cycles Since message-passing neural networks are unable to detect cycles (Chen et al., 2020), we add cycle counts to our model. Because computing traversals would be impractical on GPUs (all the more as these features are recomputed at every diffusion step), we use formulas for cycles up to size 6. We build node features (how many k-cycles does this node belong to?) for up to 5-cycles, and graph-level features (how many k-cycles does this graph contain?) for up to k = 6. We use the following formulas, where d denotes the vector containing node degrees and ||.|| F is Frobenius norm: X 3 = diag(A 3 )/2 X 4 = (diag(A 4 ) -d(d -1) -A(d1 T n )1 n )/2 X 5 = (diag(A 5 ) -2 diag(A 3 ) ⊙ d -A(diag(A 3 )1 T n )1 n + diag(A 3 ))/2 y 3 = X T 3 1 n /3 y 4 = X T 4 1 n /4 y 5 = X T 5 1 n /5 y 6 = Tr(A 6 ) -3 Tr(A 3 ⊙ A 3 ) + 9||A(A 2 ⊙ A 2 )|| F -6 ⟨diag(A 2 ), diag(A 4 )⟩ + 6 Tr(A 4 ) -4 Tr(A 3 ) + 4 Tr(A 2 Ȧ2 ⊙ A 2 ) + 3||A 3 || F -12 Tr(A 2 ⊙ A 2 ) + 4 Tr(A 2 ) Spectral features We also add the option to incorporate spectral features to the model. While this requires a O(n 3 ) eigendecomposition, we find that it is not a limiting factor for the graphs that we use in our experiments (that have up to 200 nodes). We first compute some graph-level features that relate to the eigenvalues of the graph Laplacian: the number of connected components (given by the multiplicity of eigenvalue 0), as well as the 5 first nonzero eigenvalues. We then add node-level features relative to the graph eigenvectors: an estimation of the biggest connected component (using the eigenvectors associated to eigenvalue 0), as well as the two first eigenvectors associated to non zero eigenvalues. Molecular features On molecular datasets, we also incorporate the current valency of each atom and the current molecular weight of the full molecule. The graphical model associated to our problem is presented in figure 7 : the graph size is sampled from the training distribution and kept constant during diffusion. One can notice the similarity between this graphical model and hierarchical variational autoencoders (VAEs): diffusion models can in fact be interpreted as a particular instance of VAE where the encoder (i.e., the diffusion process) is fixed. The likelihood of a data point x under the model writes:

C LIKELIHOOD COMPUTATION

log p θ (G) = log n∈N p(n) p(G T | n) p θ (G t-1 , . . . , G 1 |G T ) p θ (G|G 1 ) d(G 1 , . . . , G T ) (12) = log p(n G ) + log p(G T | n G ) T t=2 p θ (G t-1 | G t ) p θ (G|G 1 ) d(G 1 , . . . , G T ) (13) As for VAEs, an evidence lower bound (ELBO) for this integral can be computed (Sohl-Dickstein et al., 2015; Kingma et al., 2021) . It writes: log p θ (G) ≥ log p(n G )+D KL [q(G T |G) || q X (n G ) × q E (n G )] Prior loss + T t=2 L t (x) Diffusion loss + E q(G 1 |G) [log p θ (G|G 1 )] Reconstruction loss (14) with L t (G) = E q(G t |G) D KL [q(G t-1 |G t , G) || p θ (G t-1 |G t )] All these terms can be estimated: log p(n G ) is computed using the frequencies of the number of nodes for each graph in the dataset. The prior loss and the diffusion loss are KL divergences between categorical distribution, and the reconstruction loss is simply computed from the predicted probabilities for the clean graph given the last noisy graph G 1 .

True posterior distribution

We recall the derivation of the true posterior distribution q(z t- 1 |z t , x) ∝ z t (Q t ) ′ ⊙ x Qt-1 . By Bayes rule, we have: q(z t-1 |z t , x) ∝ q(z t |z t-1 , x) q(z t-1 |x) Since the noise is Markovian, q(z t |z t-1 , x) = q(z t |z t-1 ). A second application of Bayes rule gives q(z t |z t-1 ) ∝ q(z t-1 |z t )q(z t ). By writing the definition of Q t , we then observe that q(z t-1 |z t ) = z t (Q t ) ′ . We also have q(z t-1 |x) = x Qt-1 by definition. Finally, we observe that q(z t ) does not depend on z t-1 . It can therefore be seen as a part of the normalization constant. By combining the terms, we have q(z t-1 |z t , x) ∝ z t (Q t ) ′ ⊙ x Qt-1 as desired. Lemma 3.1: Equivariance Proof. Consider a graph G with n nodes, and π ∈ S n a permutation. π acts trivially on y (π.y = y), it acts on X as π.X = π ′ X and on E as: (π.E) ijk = E π -1 (i),π -1 (j),k Let G t = (X t , E t ) be a noised graph, and (π.X t , π.E t ) its permutation. Then: • Our spectral and structural features are all permutation equivariant (for the node features) or invariant (for the graph level features): f (π.G t , t) = π.f (G t , t). • The self-attention architecture is permutation equivariant. The FiLM blocks are permutation equivariant, and the PNA pooling function is permutation invariant. • Layer-normalization is permutation equivariant. DiGress is therefore the combination of permutation equivariant blocks. As a result, it is permutation equivariant: ϕ θ (π.G t , f (π.G t , t)) = π.ϕ θ (G t , f (G t , t)).

Lemma 3.2: Invariant loss

Proof. It is important that the loss function be the same for each node and each edge in order to guarantee that l(π. Ĝ, π.G) = i l X (π. Xi , x π -1 (i) ) + i,j l E (π. Êij , e π -1 (i),π -1 (j) ) = i l X ( Xi , x i ) + i,j l E ( Êij , e i,j ) = l( Ĝ, G) Lemma 3.3: Exchangeability Proof. The proof relies on the result of Xu et al. (2022) : if a distribution p(G T ) is invariant to the action of a group G and the transition probabilities p(G t-1 |G t ) are equivariant, them p(G 0 ) is invariant to the action of G. We apply this result to the special case of permutations: • The limit noise distribution is the product of i.i.d. distributions on each node and edge. It is therefore permutation invariant. • The denoising neural networks is permutation equivariant. • The function pθ (G) → p θ (G t-1 |G t ) = G q(G t-1 , G|G t )p θ (G) defining the transition probabilities is equivariant to joint permutations of pθ (G) and G t . The conditions of (Xu et al., 2022) are therefore satisfied, and the model satisfies P(X, E) = P (π.X, π.E) for any permutation π. Proof. We define L(u, v) := ||P -uv ′ || 2 2 = i,j (p ij -u i v j ) 2 . We derive this formula to obtain optimality conditions: L ∂u i = 0 ⇐⇒ j (p ij -u i v j )v i = 0 ⇐⇒ j p ij v j = u i j v 2 j ⇐⇒ u i = j p ij v j / j v 2 j Similarly, we have ∂L ∂vj = 0 ⇐⇒ v j = i p ij u i / j u 2 i . Since u and v are probability distributions, we have i,j p i,j = 1, i u i = 1 and j v j = 1. Combining these equations, we have: u i = j p ij v j j v 2 j =⇒ i u i = 1 = i,j p ij v j j v 2 j ⇐⇒ j v 2 j = j ( i p ij )v j ⇐⇒ j v 2 j = j b j v j So that: u i0 = j p i0j v j j b j v j = j p i0j i pij ui i aiui j b j i pij ui i aiui = j p i0j = m 1 i0 and similarly v j0 = b i0 . Conversely, m 1 and m 2 belong to the set of feasible solutions. We have proved that the product distribution that is the closest to the true distribution of two variables is the product of marginals (for l 2 distance). We need to extend this result to a product n i=1 u × 1≤i,j≤n v of a distribution for nodes and a distribution for edges. We now view p as a tensor in dimension a n b n 2 . We denote p X the marginalisation of this tensor across the node dimensions (p X ∈ R a n ), and p E the marginalisation across the edge dimensions (p E ∈ R b n×n ). By flattening the n first dimensions and the n 2 next, p can be viewed as a distribution over two variables (a distribution for the nodes and a distribution for the edges). By application of our Lemma, p X and p E are the optimal approximation of p. However, p X is a joint distribution for all nodes and not the product n i=1 u of a single distribution for all nodes. We then notice that: for a function f that does not depend on u. As i p X i /n is exactly the empirical distribution of node types, the optimal u is the empirical distribution of node types as desired. Overall, we have made two orthogonal projections: a projection from the distributions over graphs to the distributions over nodes, and a projection from the distribution over nodes to the product distributions u × • • • × u. Since the product distributions forms a linear space contained in the distributions over nodes, these two projections are equivalent to a single orthogonal projection from the distributions over graphs to the product distributions over nodes. A similar reasoning holds for edges. || n i=1 u -p X || 2 2 = i ||u|| 2 -2 i ⟨u, p X i ⟩ + i ||p X i || 2 = n (||u|| 2 -2 ⟨u, 1 n i p X i ⟩ + 1 n i ||p X i || 2 ) = ||u - 1 n i p X i || 2 2 + f (p X )

E SUBSTRUCTURE CONDITIONED GENERATION

Given a subgraph S = (X S , E S ) with n s nodes, we can condition the generation on S by masking the generated node and edge feature tensor at each reverse iteration step (Lugmayr et al., 2022) . As our model is permutation equivariant, it does not matter what entries are masked: we therefore choose the first n s ones. After sampling G t-1 , we update X and E using X t-1 = M X ⊙ X s + (1 -M X ) ⊙ X t-1 and E t-1 = M E ⊙ E s + (1 -M E ) ⊙ E t-1 , where M X ∈ R n×a and M E ∈ R n×n×b are masks indicating the n s first nodes. In Figure 8 , we showcase an example for molecule generation: we follow the setting proposed by (Maziarz et al., 2022) and generate molecules starting from a particular motif called 1,4-Dihydroquinolinefoot_1 .

F.1 ABSTRACT GRAPH GENERATION

Metrics The reported metrics compare the discrepancy between the distribution of some metrics on a test set and the distribution of the same metrics on a generated graph. The metrics measured are degree distributions, clustering coefficients, and orbit counts (it measures the distribution of all substructures of size 4). We do not report raw numbers but ratios computed as follows: r = MMD(generated, test) 2 / MMD(training, test) 2 The denominator MMD(training, test) 2 is taken from the results table of SPECTRE (Martinkus et al., 2022) . Note that what the authors report as MMD is actually MMD squared. Community-20 In Table 5 , we also provide results for the smaller Community-20 dataset which contains 200 graphs drawn from a stochastic block model with two communities. We observe that DiGress performs very well on this small dataset.

F.2 QM9

Metrics Because it is the metric reported in most papers, the validity metric we report is computed by building a molecule with RdKit and trying to obtain a valid SMILES string out of it. As explained by Jo et al. (2022) , this method is not perfect because QM9 contains some charged molecules which would be considered as invalid by this method. They thus compute validity using a more relaxed definition that allows for some partial charges, which gives them a small advantage. Ablation study We perform an ablation study in order to highlight the role of marginal transitions and auxiliary features. In this setting, we also measure atom stability and molecule stability as defined in (Hoogeboom et al., 2022) . Results are presented in Figure 6 . Novelty We follow Vignac & Frossard (2021) and don't report novelty for QM9 in the main table. The reason is that since QM9 is an exhaustive enumeration of the small molecules that satisfy a given set of constrains, generating molecules outside this set is not necessarily a good sign that the network has correctly captured the data distribution. For the interested reader, DiGress achieves on average a novelty of 33.4% on QM9 with implicit hydrogens, while ConGress obtains 40.0%.

F.3 MOSES AND GUACAMOL

Datasets For both MOSES and GuacaMol, we convert the generated graphs to SMILES using the code of Jo et al. (2022) that allows for some partial charges. We note that GuacaMol contains complex molecules that are difficult to process, for example because they contain formal charges or fused rings. As a result, mapping the train smiles to a graph and then back to a train SMILES does not work for around 20% of the molecules. Even if our model is able to correctly model these graphs and generate graphs that are similar, these graphs cannot be mapped to SMILES strings to be evaluated by GuacaMol. More efficient tools for processing complex molecules as graphs are therefore needed to truly achieve good performance on this dataset. Metrics Since MOSES and Guacamol are benchmarking tools, they come with their own set of metrics that we use to report the results. We briefly describe this metrics: Validity measures the proportion of molecules that pass basic valency checks. Uniqueness measures the proportion of molecules that have different SMILES strings (which implies that they are non-isomorphic). Novelty measures the proportion of generated molecules that are not in the training set. The filter score measures the proportion of molecules that pass the same filters that were used to build the test set. The Frechet ChemNetDistance (FCD) measures the similarity between molecules in the training set and in the test set using the embeddings learned by a neural network. SNN is the similarity to a nearest neighbor, as measured by Tanimoto distance. Scaffold similarity compares the frequencies of Bemis-Murcko scaffolds. The KL divergence compares the distribution of various physicochemical descriptors. Likelihood Since other methods did not report likelihood for GuacaMol and MOSES, we did not include our NLL results in the table neither. We obtain a test NLL of 129.7 on QM9 with explicit hydrogens, 205.2 on MOSES (on the separate scaffold test set) and 308.1 on GuacaMol. Size extrapolation While the vast majority of molecules in QM9 have the same number of atoms, molecules in MOSES and Guacamol have varying sizes. On these datasets, we would like to know if DiGress can generate larger molecules than it has been trained on. This problem is usually called size extrapolation in the graph neural network literature. To measure the network ability to extrapolate, we set the number of atoms to generate to n max + k, where n max is the maximal graph size in the dataset and k ∈ [5, 10, 20] . We generate 24 batches of 256 molecules (=6144 molecules) in each setting and measure the proportion of valid and unique molecules -all these molecules are novel since they are larger than the training set. The results are presented in Table 7 . We observe an important discrepancy between the two datasets: DiGress is very capable of extrapolation on GuacaMol, but completely fails on MOSES. This can be explained by the respective statistics of the datasets: MOSES features molecules that are relatively homogeneous in size. On the contrary, GuacaMol features molecules that are much larger than the dataset average. The network is therefore trained on more diverse examples, which we conjecture is why it learns some size invariance properties. The major difference in extrapolation ability that we obtain clearly highlights the value of large and diverse datasets. We finally note that our denoising network was not designed to be size invariant, as it for example features sum aggregation functions at each layer. Specific techniques such as SizeShiftReg (Buffelli et al., 2022) could also be used to improve the size-extrapolation ability of DiGress if needed for downstream applications. G SAMPLES FROM OUR MODEL While there are some failure cases (disconnected molecules or invalid molecules), our model is the first non autoregressive method that scales to these datasets that are much more complex than the standard QM9.



Code is available at github.com/cvignac/DiGress. https://pubchem.ncbi.nlm.nih.gov/compound/1_4-Dihydroquinoline



Figure 1: Overview of DiGress. The noise model is defined by Markov transition matrices Q t whose cumulative product is Qt . The denoising network ϕ θ learns to predict the clean graph from G t .During inference, the predicted distribution is combined with q(G t-1 |G, G t ) in order to compute p θ (G t-1 |G t ) and sample a discrete G t-1 from this product of categorical distributions.

(Equivariance) DiGress is permutation equivariant. Lemma 3.2. (Invariant loss) Any loss function (such as the cross-entropy loss of Eq. (

Figure 2: Reverse diffusion chains generated from a model trained on uniform transition noise (top) and marginal noise (bottom). When noisy graphs have the right marginals of node and edge types, they are closer to realistic graphs, which makes training easier.

such as cycles (Chen et al., 2020), which raises concerns about their ability to accurately capture the properties of the data distribution. While more powerful networks have been proposed such as (Maron et al., 2019; Vignac et al., 2020; Morris et al., 2022), they are significantly more costly and slower to train. Another strategy to overcome this limitation is to augment standard MPNNs with features that they cannot compute on their own. For example, Bouritsas et al. (2022) proposed adding counts of substructures of interest, and Beaini et al. (2021) proposed adding spectral features, which are known to capture important properties of graphs (Chung & Graham, 1997).

Samples from DiGress trained on SBM and planar graphs. We first evaluate DiGress on the benchmark proposed in Martinkus et al. (2022), which consists of two datasets of 200 graphs each: one drawn from the stochastic block model (with up to 200 nodes per graph), and another dataset of planar graphs (64 nodes per graph)

Figure 5: Architecture of the denoising

Figure 6: The self-attention module of our graph transformer network. It takes as input node features X, edge features E and global features y, and updates their representation. These features are then passed to normalization layers and a fully connected network, similarly to the standard transformer architecture. FiLM(M 1 , M 2 ) = M 1 W 1 + (M 1 W 2 ) ⊙ M 2 + M 2for learnable weight matrices W 1 and W 2 , and PNA(X) = cat(max(X), min(X), mean(X), std(X)) W .

Figure 7: The graphical model of DiGress and ConGress

Optimal prior distribution We first prove the following result: Lemma D.1. Let p be a discrete distribution over two variables. It is represented by a matrix P ∈ R a×b . Let m 1 and m 2 the marginal distribution of p: m 1 i ∈ arg min u,v u≥0, ui=1 v≥0, vj =1 ||P -u v ′ || 2 2

Figure 8: An example of molecular scaffold extension. We sometimes observe long-range consistency issues in the generated samples, which is in line with the observations of (Lugmayr et al., 2022) for image data. A resampling strategy similar to theirs could be used to solve this issue.

Figure 9: Non curated samples generated by DiGress trained on planar graphs (top) and graphs drawn from the stochastic block model (bottom).

Figure 10: Non curated samples generated by DiGress, trained on QM9 with implicit hydrogens (top), and explicit hydrogens (bottom).

Figure11: Non curated samples generated by Guacamol (top) and Moses (bottom). While there are some failure cases (disconnected molecules or invalid molecules), our model is the first non autoregressive method that scales to these datasets that are much more complex than the standard QM9.

These models actually solve a point cloud generation task, as they generate atomic positions rather than graph structures and thus require conformer data for training. On the contrary, Xu et al. (2022) and Jing et al. (2022) define diffusion model for conformation generation -they input a graph structure and output atomic coordinates. Apart from diffusion models, there has recently been a lot of interest in non-autoregressive graph generation using VAEs, GANs or normalizing flows (Zhu et al., 2022). (Madhawa et al., 2019; Lippe & Gavves, 2021; Luo et al., 2021) are examples of discrete models using categorical normalizing flows. However, these methods have not yet matched the performance of autoregressive models (Liu et al., 2018; Liao et al., 2019; Mercado et al., 2021) and motifs-based models (Jin et al., 2020; Maziarz et al., 2022), which can incorporate much more domain knowledge. Molecule generation on QM9. Training time is the time needed to reach 99% validity. On small graphs, DiGress achieves similar results to the continuous model but is faster to train. (Mercado et al., 2021). We also build Congress, a model that has the same denoising network as DiGress but Gaussian diffusion (Appendix A). Our results are presented without validity correction 1 .

Molecule generation on MOSES. DiGress is the first one-shot graph model that scales to this dataset. While all graph-based methods except ours have hard-coded rules to ensure high validity, DiGress outperforms GraphInvent on most other metrics.

Molecule generation on GuacaMol. We report scores, so that higher is better for all metrics. While SMILES seem to be the most efficient molecular representation, DiGress is the first general graph generation method that achieves correct performance, as visible on the FCD score.

Results on the small Community-20 dataset.

Ablation study on QM9 with explicit hydrogens. Marginal transitions improve over uniform transitions, and spectral and structural features further boost performance.

Proportion of valid and unique molecules obtained when sampling larger molecules than the maximal size in the training set. Interestingly, DiGress performs very well on GuacaMol and poorly on MOSES. We hypothesize that this is due to GuacaMol being a more diverse dataset, which forces the network to learn to generate good molecules of all sizes.Dataset statisticsValid and unique (%) n min n average n max n max + 5 n max + 10 n max + 20

ACKNOWLEDGMENTS

We thank Nikos Karalias, Éloi Alonso and Karolis Martinkus for their help and useful suggestions. Clément Vignac thanks the Swiss Data Science Center for supporting him through the PhD fellowship program (grant P18-11). Igor Krawczuk and Volkan Cevher acknowledge funding from the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme (grant agreement n°725594 -time-data). This work is licensed under a Creative Commons "Attribution 3.0 Unported" license.

Algorithm 4: Training ConGress

▷ Reverse iterations end return argmax(X 0 ), argmax(E 0 )

