GRAPH CONVOLUTIONAL NORMALIZING FLOWS FOR SEMI-SUPERVISED CLASSIFICATION & CLUSTERING Anonymous

Abstract

Graph neural networks (GNNs) are discriminative models that directly model the class posterior p(y|x) for semi-supervised classification of graph data. While being effective for prediction, as a representation learning approach, the node representations extracted from a GNN often miss useful information for effective clustering, because that is not necessary for a good classification. In this work, we replace a GNN layer by a combination of graph convolutions and normalizing flows under a Gaussian mixture representation space, which allows us to build a generative model that models both the class conditional likelihood p(x|y) and the class prior p(y). The resulting neural network, GC-Flow, enjoys two benefits: it not only maintains the predictive power because of the retention of graph convolutions, but also produces well-separated clusters in the representation space, due to the structuring of the representation as a mixture of Gaussians. We demonstrate these benefits on a variety of benchmark data sets. Moreover, we show that additional parameterization, such as that on the adjacency matrix used for graph convolutions, yields additional improvement in clustering.

1. INTRODUCTION

Semi-supervised learning (Zhu, 2008) refers to the learning of a classification model by using typically a small amount of labeled data with possibly a large amount of unlabeled data. The presence of the unlabeled data, together with additional assumptions (such as the manifold and smoothness assumptions), may significantly improve the accuracy of a classifier learned even with few labeled data. A typical example of such a model in the recent literature is the graph convolutional network (GCN) of Kipf & Welling (2017) , which capitalizes on the graph structure (considered as an extension of a discretized manifold) underlying data to achieve effective classification. GCN, together with other pioneering work on parameterized models, have formed a flourishing literature of graph neural networks (GNNs), which excel at node classification (Zhou et al., 2020; Wu et al., 2021) . However, driven by the classification task, GCN and other GNNs may not produce node representations with useful information for goals different from classification. For example, the representations do not cluster well in some cases. Such a phenomenon is of no surprise. For instance, when one treats the penultimate activations as the data representations and uses the last dense layer as a linear classifier, the representations need only be close to linearly separable for an accurate classification; they do not necessarily form well-separated clusters. This observation leads to a natural question: can one build a representation model for graphs that not only is effective for classification but also unravels the inherent structure of data for clustering? The answer is affirmative. One idea is to, rather than construct a discriminative model p(y|x) as all GNNs do, build a generative model p(x|y)p(y) whose class conditional likelihood is defined by explicitly modeling the representation space, for example by using a mixture of well-separated unimodal distributions. Indeed, the recently proposed FlowGMM model (Izmailov et al., 2020) uses a normalizing flow to map the distribution of input features to a Gaussian mixture, resulting in wellstructured clusters. This model, however, is not designed for graphs and it underperforms GNNs that leverage the graph structure for classification. In this work, we present graph convolutional normalizing flows (GC-Flows), a generative model that not only classifies well, but also yields node representations that capture the inherent structure of data, as a result forming high-quality clusters. We can relate GC-Flows to both GCNs and FlowGMMs. On the one hand, GC-Flows incorporate each GCN layer with an invertible flow. Such a flow parameterization allows training a model through maximizing the likelihood of data representations being a Gaussian mixture, mitigating the poor clustering effect of GCNs. On the other hand, GC-Flows augment a usual normalizing flow model (such as FlowGMM) that is trained on independent data, with one that incorporates graph convolutions as an inductive bias in the parameterization, boosting the classification accuracy. In Figure 1 , we visualize for a graph data set the nodes in the representation space using t-SNE. It suggests that GC-Flow inherits the clustering effect of FlowGMM, while being similarly accurate to GCN for classification. A few key characteristics of GC-Flows are as follows: 1. A GC-Flow is a GNN, because being applied to graph data, it computes node representations by using the graph structure. In contrast, a FlowGMM is not a GNN. 2. A GC-Flow is a generative model, admitting FlowGMMs as a special case when graph is absent. 3. As a generative model, the training loss function of GC-Flows involves both labeled and unlabeled data, similar to FlowGMMs, while that of GNNs involves only the labeled data. Significance. While classification is the dominant node-level task that concerns the current literature on GNNs, the importance of clustering in capturing the inherent structure of data is undeniable. This work addresses a weakness of the current GNN literature-particularly, the separation of clusters. A Gaussian mixture representation space properly reflects this goal. The normalizing flow is a vehicle to parameterize the feature transformation, so that it encourages the formation of separated Gaussians. It is a generative model that can return data densities. The likelihood training is organically tied to a generative model, whereas existing methods based on clustering or contrastive losses externally encourage the GNN to produce clustered representations, without a notation of densities.

2. RELATED WORK

Graph neural networks (GNNs) are machineries to produce node-level and graph-level representations, given graph-structured data as input (Zhou et al., 2020; Wu et al., 2021) . A popular class of GNNs are message passing neural networks (MPNNs) (Gilmer et al., 2017) , which treat information from the neighborhood of a node as messages and recursively update the node representation through aggregating the neighborhood messages and combing the result with the past node representation. Many popularly used GNNs can be considered a form of MPNNs, such as GG-NN (Li et al., 2016) , GCN (Kipf & Welling, 2017) , GraphSAGE (Hamilton et al., 2017) , GAT (Veličković et al., 2018) , and GIN (Xu et al., 2019) . Normalizing flows are invertible neural networks that can transform a data distribution to a typically simple one, such as the normal distribution (Rezende & Mohamed, 2015; Kobyzev et al., 2021; Papamakarios et al., 2021) . Because of invertibility, one may navigate the input and output distributions for purposes such as estimating densities and sampling new data. The densities of the two distributions are related by the change-of-variable formula, which involves the Jacobian determinant of the flow. Computing the Jacobian determinant is costly in general; thus, many proposed neural networks exploit constrained structures, such as the triangular pattern of the Jacobian, to reduce the computational cost. Notable examples include NICE (Dinh et al., 2015) , IAF (Kingma et al., 2016) , MAF (Papamakarios et al., 2017) , RealNVP (Dinh et al., 2017 ), Glow (Kingma & Dhariwal, 2018) , and NSF (Durkan et al., 2019) . While these network mappings are composed of discrete steps, another class of normalizing flows with continuous mappings have also been developed, which use parameterized versions of differential equations (Chen et al., 2018b; Grathwohl et al., 2019) . Normalizing flows can be used for processing or creating graph-structured data in different ways. For example, GraphNVP (Madhawa et al., 2019) and GraphAF (Shi et al., 2020) are graph generative models that use normalizing flows to generate a graph and its node features in a one-shot and a sequential manner, respectively. GANF (Dai & Chen, 2022 ) uses an acyclic directed graph to factorize the otherwise intractable joint distribution of time series data and uses the estimated data density to detect anomalies. GNF (Liu et al., 2019) is both a graph generative model and a graph neural network. For the latter functionality, GNF is relevant to our model, but its purpose is to classify rather than to cluster, thus missing the representation-space modeling and a training objective suitable for clustering. Furthermore, the architecture of GNF differs from ours in the role the graph plays. For our method, the graph adjacency matrix is part of the flow mapping, hence provoking a determinant calculation with respect to the matrix; whereas for GNF, the graph is used in the parameterization of an affine coupling layer and it incurs no determinant calculation. CGF (Deng et al., 2019) 

3. PRELIMINARIES

In this section, we review a few key concepts and familiarize the reader with notations to be used throughout the paper.

3.1. NORMALIZING FLOW

Let x ∈ R D be a D-dimensional random variable. A normalizing flow is a vector-valued invertible mapping f (x) : R D → R D that normalizes the distribution of x to some base distribution, whose density is easy to evaluate. Let such a base distribution have density π(z), where z = f (x). With the change-of-variable formula, the density of x, p(x), can be computed as p(x) = π(f (x))| det ∇f (x)|, ) where ∇f denotes the Jacobian of f . In general, such a flow f may be the composition of T constituent flows, all of which are invertible. In notation, we write i) for all i, and x (0) ≡ x and x (T ) ≡ z. Then, the chain rule expresses the Jacobian determinant as a product of the Jacobian determinants of each constituent flow: f = f T • f T -1 • • • • • f 1 , where f i (x (i-1) ) = x ( det ∇f (x) = T i=1 det ∇f i (x (i-1) ). In practical uses, the Jacobian determinant of each constituent flow needs be easy to compute, so that the density p(x) in (1) can be evaluated. One example that serves such a purpose is the affine coupling layer of Dinh et al. (2017) . For notational simplicity, we denote such a coupling layer by g(x) = y, which in effect computes y 1:d = x 1:d , y d+1:D = x d+1:D exp(s(x 1:d )) + t(x 1:d ), where d = D/2 and s, t : R d → R D-d are any neural networks. It is simple to see that the Jacobian is a triangular matrix, whose diagonal has value 1 in the first d entries and exp(s) in the remaining D -d entries. Hence, the Jacobian determinant is simply the product of the exponential of the outputs of the s-network; that is, det ∇g(x) = D-d i=1 exp(s i ).

3.2. GAUSSIAN MIXTURE AND FLOWGMM

Different from a majority of work that take the base distribution in a normalizing flow to be a single Gaussian, we consider it to be a Gaussian mixture, because this is a natural probabilistic model for clustering. Using k to index mixture components (K in total), we express the base density π(z) as π(z) = K k=1 φ k N (z; µ k , Σ k ) with N (z; µ k , Σ k ) = exp(-1 2 (z -µ k ) T Σ -1 k (z -µ k )) (2π) D/2 (det Σ k ) 1/2 , where φ k ≥ 0 are mixture weights that sum to unity and µ k and Σ k are the mean vector and the covariance matrix of the k-th component, respectively. A broad class of semi-supervised learning models specifies a generative process for each data point x through defining p(x|y)p(y), where p(y) is the prior class distribution and p(x|y) is the class conditional likelihood for data. Then, by the Bayes' Theorem, the class prediction model p(y|x) is proportional to p(x|y)p(y). Among them, FlowGMM (Izmailov et al., 2020) makes use of the flow transform z = f (x) and defines p( x|y = k) = N (f (x); µ k , Σ k )| det ∇f (x)| with p(y = k) = φ k . This definition is valid, because marginalizing over the class variable y, one may verify that p(x) = y p(x|y)p(y) is consistent with the density formula (1), when the base distribution follows (2).

3.3. GRAPH CONVOLUTIONAL NETWORK

The GCNs (Kipf & Welling, 2017) are a class of parameterized neural network models that specify the probability of class y of a node x, p(y|x), collectively for all nodes x in a graph, without defining the data generation process as in FlowGMM. To this end, we let A ∈ R n×n be the adjacency matrix of the graph, which has n nodes, and let X = [x 1 , • • • , x n ] T ∈ R n×D be the input feature matrix, with x i being the feature vector for the i-th node. We further let P ∈ R n×K be the output probability matrix, where K is the number of classes and P ik ≡ p(y = k|x i ). An L-layer GCN is written as X (i) = σ i ( AX (i-1) W (i-1) ), i = 1, . . . , L, where X ≡ X (0) and P ≡ X (L) . Here, σ i is an element-wise activation function, such as ReLU, for the intermediate layers i < L, while σ L is the row-wise softmax activation function for the final layer. The matrices W (i) , i = 0, . . . , L -1, are learnable parameters and A denotes a certain normalized version of the adjacency matrix A. The standard definition of A for an undirected graph is A = D -1 2 A D -1 2 , where A = A + I and D = diag j A ij , but we note that many other variants of A are used in practice as well (such as A = D -1 A).

4. METHOD

The proposed graph convolutional normalizing flow (GC-Flow) extends a usual normalizing flow acting on data points separately to one that acts on all graph nodes collectively. Following the notations used in Section 3.3, starting with X (0) ≡ X, where X is an n × D input feature matrix for all n nodes in the graph, we define a GC-Flow F(X) : R n×D → R n×D that is a composition of T constituent flows F = F T • F T -1 • • • • • F 1 , where each constituent flow F i computes X (i) = F i ( AX (i-1) X (i) ), i = 1, . . . , T. (4) The final representation of the nodes is the matrix Z ≡ X (T ) . GC-Flow is a normalizing flow. Similar to other normalizing flows, each constituent flow preserves the feature dimension; that is, each F i is an R n×D → R n×D function. Furthermore, we let F i act on each row of the input argument X (i) separately and identically. In other words, from the functionality perspective, F i can be equivalently replaced by some function f i : R 1×D → R 1×D that computes x (i) j = f i ( x (i) j ) for a node j. The main difference between GC-Flow and a usual flow is that the input argument of f i contains not only the information of node j but also that of its neighbors. One may consider a usual flow to be a special case of GC-Flows, when A = I (e.g., the graph contains no edges). Moreover, GC-FLow is a GNN. In particular, a constituent flow F i of (4) resembles a GCN layer of (3) by making use of graph convolutions-multiplying A to the flow/layer input X (i-foot_0) . When A results from the normalization defined by GCN, such a graph convolution approximates a lowpass filter (Kipf & Welling, 2017) . In a sense, the GC-Flow architecture is more general than a GCN architecture, because one may interpret the dense layer (represented by the parameter matrix W (i-1) ) followed by a nonlinear activation σ i in (3) as an example of the constituent flow F i in (4). However, such a conceptual connection needs a few adjustments to make a GC-Flow and a GCN mathematically equivalent, because W (i-1) in GCN is not required to preserve the feature dimension and σ i of GCN has a zero derivative on the negative axis, compromising invertibility. The nearest adjustment can be made via using the Sylvester flow (van den Berg et al., 2018) , which adds a residual connection and uses an additional parameter matrix U (i-1) to preserve the feature dimension: 1 X (i) = X (i-1) + σ i ( AX (i-1) W (i-1) )U (i-1) . However, the Sylvester flow generally has a limited capacity (Kobyzev et al., 2021) and a more sophisticated flow is instead used as F i , such as the affine coupling layer introduced in Section 3.1. A major distinction between the GC-Flow and a usual GNN lies in the training objective. To encourage a good clustering structure of the representation Z, we use a maximum-likelihood kind of objective for all graph nodes, because it is equivalent to maximizing the likelihood that Z forms a Gaussian mixture: max L := 1 -λ |D l | (x,y=k)∈D l log p(x, y = k) + λ |D u | x∈Du log p(x), where D l denotes the set of labeled nodes, D u denotes the set of unlabeled nodes, and λ ∈ (0, 1) is a tunable hyperparameter balancing labeled and unlabeled information. It is useful to compare L with the usual (negative) cross-entropy loss for training GNNs. First, for training a usual GNN, no loss is incurred on the unlabeled nodes, because their likelihoods are not modeled. Second, for a labeled node x with true label k, the negative cross-entropy is log p(y = k|x), while the likelihood term over labeled data in ( 5) is a joint probability of x and y: log p(x, y = k) = log p(y = k|x) + log p(x). Fundamentally, GC-Flow belongs to the class of generative classification models, while GNNs belong to the class of discriminative models. Under Bayesian paradigm, the former models the class prior and the class conditional likelihood, while the latter models only the posterior. In what follows, we will define the proposed probability model p(x|y)p(y) for a node x, so that the loss L can be computed and the label y can be predicted via argmax k p(y = k|x). We first need an important lemma on the Jacobian determinant when a graph convolution is involved in the flow.

4.1. DETERMINANT LEMMA

The Jacobian determinant of each constituent flow F i defined in ( 4) is needed for training a GC-Flow. The Jacobian is an nD × nD matrix, but it admits a special block structure that allows the determinant to be computed as a product of determinants on D matrices of size n × n, after rearrangement of the QR factorization factors of the Jacobians of f i . The following lemma summarizes this finding; the proof is given in the appendix. For notational convenience, we remove the flow index and use G to denote a generic constituent flow. Lemma 1. Let X ∈ R n×D and A ∈ R n×n . Let Y = G( X) , where X ≡ AX and G : R n×D → R n×D acts on each row of the input matrix independently and identically. Let g : R D → R D be functionally equivalent to G; that is, y i = g( x i ) where y i and x i are the i-th row of Y and X, respectively. Then, det dY dX = | det A| D n i=1 | det ∇g( x i )|. Putting back the flow index, the above lemma suggests that, by the chain rule, the Jacobian determinant of the entire GC-Flow F is | det ∇F(X)| = | det A| T D T j=1 n i=1 | det ∇f j ( x (j) i )|. Note that to maintain invertibility of the flow, the matrix A must be nonsingular. We will define the probability model for GC-Flow based on equality (6).

4.2. PROBABILITY MODEL

Different from a usual normalizing flow, where the representation z i for the i-th data point depends on its input feature vector x i , in a GC-Flow, z i depends on (a possibly substantial portion of) the entire node set X, because of the A-multiplication. To this end, we use p(X) and π(Z) to denote the joint distribution of the node feature vectors and that of the representations, respectively. We still have, by the change-of-variable formula, p(X) = π(Z)| det ∇F(X)|, where the Jacobian determinant has been derived in (6). Under the freedom of modeling and for convenience, we opt to let π(Z) be expressed as π(Z) = π(z 1 )π(z 2 ) • • • π(z n ) , where each π(z i ) is an independent and identically distributed Gaussian mixture (2). Similarly, we assume the nodes to be independent to start with; that is, p(X) = p(x 1 )p(x 2 ) • • • p(x n ). For generative modeling, a task is to model the class prior p(y) and the class conditional likelihood p(x|y), such that the posterior prediction model p(y|x) can be easily obtained as proportional to p(x|y)p(y), by Bayes' Theorem. To this end, we define p(x i |y i = k) := N (z i ; µ k , Σ k )| det A| T D/n T j=1 | det ∇f j ( x (j) i )| and p(y i = k) = φ k . ( ) Such a definition is self-consistent. First, marginalizing over the label y i and using the Gaussian mixture definition (2) for π(z i ), we obtain the marginal likelihood p(x i ) = π(z i )| det A| T D/n T j=1 | det ∇f j ( x (j) i )|. Then, by the modeling of π(Z) and p(X), taking the product for all nodes and using the Jacobian determinant formula derived in (6), we exactly recover the density formula (7). We will use ( 8) and ( 9) to compute the labeled part and the unlabeled part of the loss (5), respectively. The modeling of π(Z) as a product of π(z i )'s reflects independence, which may seem conceptually at odds with graph convolutions, where a node's representation depends on the information of nodes in it's T -hop neighborhood. However, nothing prevents the convolution results to be independent, just like the case that a usual normalizing flow can decorrelate the input features and make each transformed feature independent, when postulating a standard normal distribution output. It is the aim of the independence of the z i 's that enables finding the most probable GC-Flow.

4.3. TRAINING AND COSTS

Despite inheriting the generative characteristics of FlowGMMs (including the training loss), GC-Flows are by nature a GNN, because the graph convolution operation ( A-multiplication) involves a node's neighbor set when computing the output of a constituent flow for this node. Due to space limitation, we discuss the complication of training and inference owing to neighborhood explosion in Appendix C; these discussions share great similarities with the GNN case. Additionally, we compare the full-batch training costs of GC-Flow and GCN in Appendix D, which suggests that they are comparable and admit the same scaling behavior.

4.4. IMPROVING PERFORMANCE THROUGH PARAMETERIZING A

So far, we have treated A as the normalization of the graph adjacency matrix A defined by GCN (see Section 3.3). One convenience of doing so is that det A is a constant and can be safely omitted in the loss calculation. One may improve the quality of GC-Flow through introducing parameterizations to A. One approach, which we call GC-Flow-p, is to parameterize the edge weights. This approach is similar to GAT (Veličković et al., 2018) that uses attention weights to redefine A. Another approach, which we call GC-Flow-l, is to learn A in its entirety without resorting to the (possibly unknown) graph structure. For this purpose, several approaches have been developed; see, e. In a later experiment, we will give examples for GC-Flow-p and GC-Flow-l (see Appendix B for details) and investigate the performance improvement over GC-Flow. Note that the parameterization may lead to a different A for each constituent flow. Hence, we add the flow index (j) to A and rewrite the Jacobian determinant (6) as | det ∇F(X)| = T j=1 | det A (j) | D n i=1 | det ∇f j ( x i )| . The probability models ( 8) and ( 9) are correspondingly rewritten, respectively, as p(x i |y i = k) := N (z i ; µ k , Σ k ) T j=1 | det A (j) | D/n | det ∇f j ( x (j) i )| and p(y i = k) = φ k , p(x i ) = π(z i ) T j=1 | det A (j) | D/n | det ∇f j ( x (j) i )|. These two formulas are used to substitute the labeled and unlabeled parts of the loss (5), respectively.

5. EXPERIMENTS

In this section, we conduct a comprehensive set of experiments to evaluate the performance of GC-Flow on graph data and demonstrate that it is competitive with GNNs for classification, while being advantageous in learning representations that extract the clustering structure of the data. Data sets. We use six benchmark node-classification data sets. Data sets Cora, Citeseer, and Pubmed are citation graphs, where each node is a document and each edge represents the citation relation between two documents. We follow the predefined splits in Kipf & Welling (2017) . Data sets Computers and Photo are subgraphs of the Amazon co-purchase graph (McAuley et al., 2015) . They do not have a predefined split. We randomly sample 200/1300/1000 nodes for training/validation/testing for Computers and 80/620/1000 for Photo. The data set Wiki-CS is a web graph where nodes are Wikipedia articles and edges are hyperlinks (Mernyei & Cangea, 2020) . We use one of the predefined splits. For statistics of the data sets, see Table 4 in Appendix E. Baselines. We compare GC-Flow with both discriminative and generative models. For discriminative models, we use three widely used GNNs: GCN, GraphSAGE, and GAT. For generative models, besides FlowGMM, we use the basic Gaussian mixture model (GMM). GMM is not parameterized; it takes either the node features X or the graph-transformed features X = AX as input. Metrics. For measuring classification quality, we use the standard micro-averaged F1 score. For evaluating clustering, we mainly use the silhouette coefficient. This metric does not require groundtruth cluster labels and it measures the separation of clusters. Better separation indicates a more interpretable structure. We additionally use NMI (normalized mutual information) and ARI (adjusted rand index) to measure clustering quality with known ground truths. Implementation details and hyperparameter information may be found in Appendix E. Classification and clustering performance. Table 1 lists the F1 scores and the silhouette coefficients for all data sets and all compared models. GNNs are always better than GMMs for classification; while the flow version of GMM, FlowGMM, beats all GNNs on cluster separation. Our model, GC-Flow, is competitive with the better of the two and is always the best or the second best. When being the best, some of the improvements are rather substantial, such as the F1 score for Computers and the silhouette coefficient for Wiki-CS. It is interesting to note that the basic GMMs perform rather poorly. This phenomenon is not surprising, because without any neural network parameterization, they cannot compete with other models that allow feature transformations to encourage class separation or cluster separation. To further illustrate the clustering quality of GC-Flow, we compare it with several contrastive-based methods that produce competitive clusterings: DGI (Veličković et al., 2019) , GRACE (Zhu et al., 2020) , GCA (Zhu et al., 2021) , GraphCL (You et al., 2020) , and MVGRL (Hassani & Khasahmadi, 2020) . Table 2 lists the results for Cora and Table 5 in Appendix F includes more data sets. For Cora, GC-Flow delivers the best performance on all metrics, with a silhouette score more than double of the second best. Compared with NMI and ARI, silhouette is a metric that takes no knowledge of the ground truth but measures solely the cluster separation in space. This result suggests that the clusters obtained from GC-Flow are more structurally separated, albeit improving less the cluster agreement. Training behavior. Figure 5 (see Appendix F) plots the convergence behavior of the training loss for FlowGMM, GCN, and GC-Flow. All methods converge favorably, with GCN reaching the plateau earlier, while FlowGMM and GC-Flow converge at a rather similar speed. Visualization of the representation space. To complement the numerical metrics, we visualize the representation space of FlowGMM, GCN, and GC-Flow by using a t-SNE plot (Van der Maaten & Hinton, 2008) , for qualitative evaluation. The representations for FlowGMM and GC-Flow are the z i 's, while those for GCN are extracted from the penultimate activations. The results for Cora are given earlier in Figure 1 ; we additionally give the results for Pubmed in Figure 2 . From both figures, one sees that similar to FlowGMM, GC-Flow exhibits a better clustering structure than does GCN, which produces little separation for the data. More visualizations are provided in Appendix F. Analysis on depth. Figure 3 plots the performance of FlowGMM, GCN, and GC-Flow as the number of layers/flows increases. One sees that the classification performance of GCN deteriorates with more layers, in agreement with the well-known oversmoothing phenomenon (Li et al., 2018) , while the clustering performance is generally stable. On the other hand, the classification performance of FlowGMM and GC-Flow does not show a unique pattern: for Cora, it degrades, while for Pubmed, it stabilizes. The clustering performance of FlowGMM and GC-Flow generally degrades, except for the curious case of GC-Flow on Cora, where the silhouette coefficient shows a V-shape. Nevertheless, generally a smaller depth is preferred for all models. Micro-F1 Silhouette Micro-F1 Silhouette Micro-F1 GC-Flow 0.669 ± 0.021 0.791 ± 0.009 0.487 ± 0.012 0.847 ± 0.007 0.655 ± 0.013 0.917 ± 0.004 GC-Flow-p 0.804 ± 0.010 0.790 ± 0.007 0.706 ± 0.019 0.841 ± 0.006 0.874 ± 0.011 0.914 ± 0.008 GC-Flow-l 0.856 ± 0.029 0.783 ± 0.009 0.582 ± 0.020 0.851 ± 0.009 0.842 ± 0.006 0.911 ± 0.005 Analysis on labeling rate. Figure 4 plots the performance of FlowGMM, GCN, and GC-Flow as the number of training labels per class increases. One sees that for all models, the performance generally improves with more labeled data. The improvement is more steady and noticeable for classification, while being less significant for clustering. Additionally, GC-Flow classifies significantly better than does GCN at the low-labeling rate regime, achieving a 10.04% relative improvement in the F1 score when there are only two labeled nodes per class. Improving performance with additional parameterization. We experiment with two variants of GC-Flow by introducing parameterizations to A. The variant GC-Flow-p uses an idea similar to GAT, through embedding flow inputs and computing an additive attention on the graph edges to redefine their weights. Another variant GC-Flow-l also computes weights, but rather than using them to define A, it treats each weight as a probability of edge presence and samples the corresponding Bernoulli distribution to obtain a binary sample A. The details are given in Appendix B. Table 3 lists the performance of GC-Flow-p and GC-Flow-l on three selected data sets, where the improvement over GC-Flow is notable. The improvement predominantly appears for clustering, with the most striking increase from 0.487 to 0.706. The increase of silhouette coefficients generally come with a marginal decrease in the F1 score, but the decrement amount is below the standard deviation. In one occasion (Computers), the F1 score even increases, despite also being marginal.

6. CONCLUSIONS

We have developed a generative GNN model which, rather than directly computing the class posterior p(y|x), computes the class conditional likelihood p(x|y) and applies the Bayes rule together with the class prior p(y) for prediction. A benefit of such a model is that one may control the representation of the data (e.g., a clustering structure) through modeling the representation distribution (e.g., optimizing it toward a mixture of well-separated unimodal distributions). We achieve so by designing the GNN as a normalizing flow that additionally incorporates graph convolutions. Interestingly, the graph adjacency matrix appears in the density computation of the normalizing flow as a stand-alone term, which could be ignored if it is a constant, or easily optimized if it is parameterized. We demonstrate that the proposed model not only maintains the predictive power of the past GNNs, but also produces high-quality clusters in the representation space.

A PROOF OF LEMMA 1

The Jacobian dY dX is an nD × nD matrix. By the chain rule, an entry of it is dYij dXpq = A ip J i jq , where J i := ∇g( x i ) ∈ R D×D . Hence, dY dX can be expressed as the following block matrix (up to permutation and transpose that do not affect the absolute value of the determinant):      A 11 J 1 A 12 J 1 • • • A 1n J 1 A 21 J 2 A 22 J 2 • • • A 2n J 2 . . . . . . . . . . . . A n1 J n A n2 J n • • • A nn J n      . Since any matrix admits a QR factorization, let J i = Q i R i for all i, where Q i is unitary and R i is upper-triangular. Then, the above block matrix is equal to     Q 1 Q 2 . . . Q n          A 11 R 1 A 12 R 1 • • • A 1n R 1 A 21 R 2 A 22 R 2 • • • A 2n R 2 . . . . . . . . . . . . A n1 R n A n2 R n • • • A nn R n      . Because left block matrix is unitary, it does not change the determinant. Hence, we only need to compute the determinant of the right block matrix. We may rearrange this matrix into the following form, while maintaining the absolute value of the determinant:      A S 11 A S 12 • • • A S 1n A S 21 A S 22 • • • A S 2n . . . . . . . . . . . . A S n1 A S n2 • • • A S nn      where S ij =      R 1 ij R 1 ij • • • R 1 ij R 2 ij R 2 ij • • • R 2 ij . . . . . . . . . . . . R n ij R n ij • • • R n ij      for all i, j pairs. Because each R k is upper-triangular, the matrix S ij is zero whenever i > j. Therefore, the block matrix above is block upper-triangular and its absolute determinant is equal to D i=1 | det( A S ii )|. Note that S ii is a matrix with identical rows, for each i. Then, by the definition of determinant (see the Leibniz formula), det( A S ii ) = det( A) • R 1 ii R 2 ii • • • R n ii . Therefore, D i=1 | det( A S ii )| = | det A| D • |R 1 11 R 2 11 • • • R n 11 R 1 22 R 2 22 • • • R n 22 • • • R 1 DD R 2 DD • • • R n DD | = | det A| D • | det J 1 | • | det J 2 | • • • | det J n |, which concludes the proof.

B PARAMETERIZATIONS OF A

With parameterization, A may differ in the constituent flows. Hence, we use the flow index j to distinguish them; i.e., A (j) . Let a graph be denoted by G = (V, E), where V is the node set and E is the edge set.

B.1 GC-FLOW-P VARIANT

The GC-Flow-p variant parameterizes A (j) by using an idea similar to GAT (Veličković et al., 2018) , where an existing edge (i, k) ∈ E is reweighted by using attention scores. Let E 1 , E 2 : R D → R d be two embedding networks that map a D-dimensional flow input to a d-dimensional vector, and let M : R 2d → R 1 be a feed-forward network. We compute two sets of vectors h (j-1) i = ReLU E 1 (x (j-1) i ) , g (j-1) k = ReLU E 2 (x (j-1) k ) and concatenate them to compute a pre-attention coefficient α (j) ik for all (i, k) ∈ E: α (j) ik = M [h (j-1) i || g (j-1) k ] . Then, constructing a matrix S (j) where S (j) ik = LeakyReLU(α (j) ik ) if(i, k) ∈ E, -∞ otherwise, we define A (j) through a row-wise softmax: j) ). A (j) = softmax(S ( This parameterization differs from GAT mainly in using more complex embedding networks E 1 and E 2 than a single feed-forward layer to compute the vectors h (j-1) i and g (j-1) k . Moreover, we do not use multiple heads.

B.2 GC-FLOW-L VARIANT

The GC-Flow-l variant learns a new graph structure. For computational efficiency, the learning of the structure is based on the given edge set E; that is, only edges are removed from E but no edges are inserted outside E. The method follows Luo et al. (2021) , which hypothesizes that the existing edge set is noisy and aims at removing the noisy edges. The basic idea uses a prior work on differentiable sampling (Maddison et al., 2016; Jang et al., 2016) , which states that the random variable e = σ log -log 1 -+ ω /τ where ∼ Uniform 0, 1 follows a distribution that converges to a Bernoulli distribution with success probability p = (1 + e -ω ) -1 as τ > 0 tends to zero. Hence, we if parameterize ω and specify that the presence of an edge between a pair of nodes has probability p, then using e computed from (10) to fill the corresponding entry of A will produce a matrix A that is close to binary. We can use this matrix in GC-Flow, with the hope of improving classification/clustering performance due to the ability of denoising edges. Moreover, because (10) is differentiable with respect to ω, we can train the parameters of ω like in a usual gradient-based training. To this end, we let E 1 , E 2 : R D → R d be two embedding networks that embed the pairwise flow inputs as a (j) ik = tanh E 1 x (j) i tanh E 2 x (j) k , ∀ (i, k) ∈ E b (j) ik = tanh E 2 x (j) i tanh E 1 x (j) k , ∀ (i, k) ∈ E where a (j) ik , b ik ∈ R d and is the Hadamard product. Then, we take their difference and compute ω (j) ik = tanh 1 T (a (j) ik -b (j) ik ) , followed by e (j) ik = σ log -log 1 -+ ω (j) ik /τ where ∼ Uniform 0, 1 , which returns an approximate Bernoulli sample for the edge (i, k). When τ is not sufficiently close to zero, this sample may not be close enough to binary, and in particular, it is strictly nonzero. To explicitly zero out an edge, we follow Louizos et al. (2017) and introduce two parameters, γ < 0 and ξ > 1, to remove small values of e This definition of A (j) does not insert new edges to the graph (i.e., when (i, k) / ∈ E, A ik = 0), but only removes (denoises) some edges (i, k) originally in E.

C TRAINING AND INFERENCE

Despite inheriting the generative characteristics of FlowGMMs (including the training loss), GC-Flows are by nature a GNN, because the graph convolution operation ( A-multiplication) involves a node's neighbor set when computing the output of a constituent flow for this node. Across all constituent flows, the evaluation of the loss on a single node will require the information of the entire T -hop neighborhood, causing scalability challenges for large graphs. One may perform fullbatch training (the deterministic gradient decent method), which minimizes the multiple evaluations on a node in any constituent flow. Such an approach is the most convenient to implement in the current deep learning frameworks; typical GPU memory can afford handling a medium-scale graph and CPU memory can afford even larger graphs. If one opts to perform mini-batch training (the stochastic gradient decent method), neighborhood sampling for GNNs (e.g., node-wise (Hamilton et al., 2017; Ying et al., 2018) , layer-wise (Chen et al., 2018a; Zou et al., 2019) , or subgraphsampling (Chiang et al., 2019; Zeng et al., 2020) ) is a popular approach to reducing the computation within the T -hop neighborhood. Inference is faced with the same challenge as training, but since it requires only a single pass on the test set, doing so in full batch or by using large mini-batches may suffice. If one were to use normal mini-batches with neighborhood sampling (for reasons such as being consistent with training), empirical evidence of success has been demonstrated in the GNN literature (Kaler et al., 2022) .

D COMPLEXITY ANALYSIS

Let us analyze the cost of computing the likelihood loss (5) for GC-Flows. Based on ( 8) and ( 9), the cost consists of three parts: that to compute X (j) = AX (j-1) for each constituent flow indexed by j, that to compute the Jacobian determinant det ∇f j ( x (j) i ) for each node i, and that to compute the graph-related determinant det A. For a fixed graph, the third part is a constant and is omitted in training. The cost of the second part varies according to the type of the flow. If we use an affinecoupling flow as exemplified in Section 3.1, let the cost of the s and t networks be C st . Then, the cost of computing the overall loss can be summarized as O nz( A)DT + nT C st , where nz( A) denotes the number of nonzeros of A and recall that D and T are the feature dimension and the number of flows, respectively. An affine-coupling flow can be implemented with varying architectures. For example, for a usual MLP, C st = O L-1 i=0 h i h i+1 , where h 0 = D/2 ; h 1 . . . h L-1 are hidden dimensions; and h L = 2 D/2 . Note that O(C st ) dominates the cost of computing a matrix determinant, which is only O(D), because the Jacobian matrix is triangular in an affine-coupling flow. It would be useful to compare the above cost with that of the cross-entropy loss for a usual GCN: O nz( A) T -1 j=0 d j + n T -1 j=0 d j d j+1 , where d 0 = D; d 1 . . . d T -1 are hidden dimensions; and d T = K, the number of classes. Here, we assume that the GCN has T layers, comparable to GC-Flow. The two costs (11) and ( 12) are comparable, part by part. For the first part, DT is comparable to T -1 j=0 d j , if all the d j 's are similar. In some data sets, the input dimension of GCN is much higher than the hidden and output dimensions, but we correspondingly perform a dimension reduction on the input features when running GC-Flows (as is the practice in the experiments), reducing DT to D T for some D D. For the second part, the number L of hidden layers in each flow is typically a small number (say, 5), and the hidden dimensions h j 's are comparable to the input dimension h 0 , rendering comparable terms T C st versus T -1 j=0 d j d j+1 . Overall, the computational costs of GC-Flow and GCN are similar and their scaling behaviors are the same. It is worth noting that when one parameterizes A, the cost of computing A (j) for each flow/layer j will need to be added to the loss computation, for both GC-Flows and GCNs. The cost depends on the specific parameterization and it can be either cheap or expensive. Additionally, GC-Flows require the computation of det A (j) , whose cost depends on the structure of the matrix, which in turn is determined by the parameterization. For GCN and GraphSAGE, we set the hidden size to 128, dropout rate to 0.5, and learning rate to 0.01. For GAT, we follow Veličković et al. (2018) to set the hidden size to 8, the number of attention heads to 8, droupout rate to 0.6, and the learning rate to 0.005. The results in Table 1 for GC-Flow are obtained by using different number of flows and number of layers per flow for different data sets. For Computers, Photo, and Wiki-CS, we use 4, 2, and 2 flows, respectively, with 6 dense layers in each flow. For Cora, Citeseer and Pubmed, we use 4, 10 and 10 flows, respectively, with 10 dense layers in each flow. The results in Table 3 also use different number of flows and layers. For GC-Flow-p, on Cora, Citeseer, Pubmed, Photo, and Wiki-CS, we use 10 flows; while on Computers, we use 6 flows. For GC-Flow-1, we use 2 flows for Wiki-CS and 4 flows for the rest of the data sets. For both For GC-Flow-p and GC-Flow-l, there are 6 dense layers per flow on Wiki-CS while 10 per flow on the rest of the data sets.

F ADDITIONAL EXPERIMENT RESULTS

Clustering performance. Table 5 compares the clustering performance between GC-Flow and various contrastive-based GNN methods, for Citeseer and Pubmed. GC-Flow maintains the attractively best performance on the silhouette score, which measures cluster separation. For the cluster assignment metrics, GC-Flow remains competitive on ARI, which measures cluster similarity, while falling back on NMI, which measures cluster agreement. The loss for GCN is the cross-entropy while the loss for FlowGMM and GC-Flow is the likelihood. All the curves are smooth and exhibit reasonable decay, suggesting convergence as expected. GC-Flow converges similarly to FlowGMM, while GCN converges faster.

Visualization of the representation space.

Figure 6 shows the t-SNE plots for data sets Citeseer, Computers, Photo, and Wiki-CS, one per row.



For notational convenience and consistency with the GNN literature, here we omit the often-used bias term.



Figure 1: Representation space of the data set Cora under different models, visualized by t-SNE. Coloring indicates groud-truth labeling. Silhouette coefficients measure cluster separation. Micro-F1 scores measure classification accuracy.

g., Franceschi et al. (2019); Wu et al. (2020); Shang et al. (2021); Fatemi et al. (2021); Dai & Chen (2022).

Figure 2: Representation space of the data set Pubmed under different models.

Figure 3: Performance variation with respect to the network depth/number of flows.

ik (ξ -γ) + γ.

Figure 5: Convergence of the training loss (Cora).

extends the continuous version of normalizing flows to graphs, where the dynamics of the differential equation is parameterized as a message passing layer. The difference between our model and CGF inherits the general difference between discrete and continuous flows in how different parameterizations transform distributions.

Comparison of GMM-based generative models, GNN-based discriminative models, and GC-Flow for semi-supervised classification and clustering. Standard deviations are obtained by repeating model training ten times. For each data set and metric, the two best cases are boldfaced.

Clustering performance of various GNN methods. The two best cases are boldfaced. Data set: Cora.

Effect of modeling A. Boldfaced numbers indicate improvement over GC-Flow.

Clustering performance of various GNN methods. The two best cases are boldfaced. Data sets: Citeseer (top) and Pubmed (bottom).

E EXPERIMENT DETAILS

Data sets. Table 4 summarizes the statistics of the benchmark data sets used in this paper. Computing environment. We implemented all models using PyTorch (Paszke et al., 2019) , Py-Torch Geometric (Fey & Lenssen, 2019) , and Scikit-learn (Pedregosa et al., 2011) . All data sets used in the experiments are obtained from PyTorch Geometric. We conduct the experiments on a server with four NVIDIA RTX A6000 GPUs (48GB memory each).Implementation details. For fair comparison, we run all models on the entire data set under the transductive semi-supervised setting. All models are initialized with Glorot initialization (Glorot & Bengio, 2010) and are trained using the Adam optimizer (Kingma & Ba, 2015) . For reporting the silhouette coefficient, k-means is run for 1000 epochs. For all on all data sets, the 2 weight decay factor is set to 5 × 10 -4 and the number of training epochs is set to 400. For all models, we use the early stopping strategy on the F1 score on validation set. In all experiments, we use A = (D + I) -1 (A + I), where D = diag( j A ij ). For FlowGMM, GC-Flow, and its variants, we clip the norm of the gradients to with the range [-50, 50] . For GC-Flow-l and GC-Flow-p, since A (j) in each flow may not has a full rank, we add to A (j) a diagonal matrix with damping value 10 -3 . Moreover, the slope in LeakyReLU is set to 0.2. In all models involving normalizing flows, we use RealNVP (Dinh et al., 2017) with coupling layers implemented by using MLPs. Following Izmailov et al. (2020) , the mean vectors are parameterized as some scalar multiple of the vector of all ones, and the covariance matrices are parameterized as some scalar multiple of the identity matrix. On the other hand, the GMM models are implemented by using Scikit-learn with full covariance matrices. The number of training epochs for GMMs is set to 200.Too large a feature dimension renders a challenge on the training of a normalizing flow. Hence, we perform dimension reduction in such a case. For Cora, Pubmed, Computers, and Photo, we use PCA to reduce the feature dimension to 50; and for Citeseer, to 100. We keep the dimension 300 on Wiki-CS without feature reduction.Hyperparameters. We use grid search to tune the hyperparameters of FlowGMM, GC-Flow, and its variants. The search spaces are listed in the following:• Number of flow layers: 2, 4, 6, 8, 10, 12, 14, 16, 18, 20; • Number of dense layers in each flow: 6, 10, 14;• Hidden size of flow layers: 128, 256, 512, 1024;• Weighting parameter λ: 0.01, 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5;• Gaussian mean and covariance scale: [0.5, 10];• Initial learning rate: 0.001, 0.002, 0.003, 0.005;• Dropout rate: 0.1, 0.2, 0.3, 0.4, 0.5, 0.6.Additionally, for GC-Flow-p and GC-Flow-l:• Number of dense layers in E 1 and E 2 : 4, 6;• Hidden size of dense layers in E 1 and E 2 : 128, 256;• Embedding dimension d: 8, 16. 

