ON SINGLE-ENVIRONMENT EXTRAPOLATIONS IN GRAPH CLASSIFICATION AND REGRESSION TASKS Anonymous

Abstract

Extrapolation in graph classification/regression remains an underexplored area of an otherwise rapidly developing field. Our work contributes to a growing literature by providing the first systematic counterfactual modeling framework for extrapolations in graph classification/regression tasks. To show that extrapolation from a single training environment is possible, we develop a connection between certain extrapolation tasks on graph sizes and Lovász's characterization of graph limits. For these extrapolations, standard graph neural networks (GNNs) will fail, while classifiers using induced homomorphism densities succeed, but mostly on unattributed graphs. Generalizing these density features through a GNN subgraph decomposition allows them to also succeed in more complex attributed graph extrapolation tasks. Finally, our experiments validate our theoretical results and showcase some shortcomings of common (interpolation) methods in the literature.

1. INTRODUCTION

In some graph classification and regression applications, the graphs themselves are representations of a natural process rather than the true state of the process. Molecular graphs are built from a pairwise atom distance matrix by keeping edges whose distance is below a certain threshold and the choice impacts distinguishability between molecules (Klicpera et al., 2020) . Functional brain connectomes are derived from time series but researchers must choose a frequency range for the signals, which affects resulting graph structure (De Domenico et al., 2016) In this work, we refer to graph-processing environment (or just environment) as the collection of heuristics and other data curation processes that gave us the observed graph from the true state of the process under consideration. The true state alone defines the target variable. Our work is interested in what we refer as the graph extrapolation task: predict a target variable from a graph regardless of its environment. In this context, even graph sizes can be determined by the environment. Unsurprisingly, graph extrapolation tasks-a type of out-of-distribution prediction-are only feasible when we make assumptions about these environments. We define the graph extrapolation task as a counterfactual inference task that requires learning environment-invariant (E-invariant) representations. Unfortunately, graph datasets largely contain a single environment, while common E-invariant representation methods require training data from multiple environments, including Independence of causal mechanism (ICM) methods (Bengio et al., 2019; Besserve et al., 2018; Johansson et al., 2016; Louizos et al., 2017; Raj et al., 2020; Schölkopf, 2019; Arjovsky et al., 2019) , Causal Discovery from Change (CDC) methods (Tian & Pearl, 2001) , and representation disentanglement methods (Bengio et al., 2019; Goudet et al., 2017; Locatello et al., 2019) . Contributions. Our work contributes to a growing literature by providing, to the best of our knowledge, the first systematic counterfactual modeling framework for extrapolations in graph classification/regression tasks. Existing work, e.g., the parallel work of Xu et al. ( 2020), define extrapolations geometrically and, thus, have a different scope. Our work connects Lovász's graph limit theory with graph-size extrapolation in a family of graph classification and regression tasks. Moreover, our experiments show that in these tasks, traditional graph classification/regression methods -including graph neural networks and graph kernels-are unable to extrapolate. 

2. A FAMILY OF GRAPH EXTRAPOLATION TASKS

Geometrically, extrapolation can be thought as reasoning beyond a convex hull of a set of training points (Hastie et al., 2012; Haffner, 2002; King & Zeng, 2006; Xu et al., 2020) . However, for neural networks-and their arbitrary representation mappings-this geometric interpretation is insufficient to describe a truly broad range of tasks. Rather, extrapolations are better described through counterfactual reasoning (Neyman, 1923; Rubin, 1974; Pearl, 2009; Schölkopf, 2019) . Specifically we want to ask: After seeing training data from environment A, how to extrapolate and predict what would have been the model predictions of a test example from an unknown environment B, had the training data also been from B? For instance, what would have been the model predictions for a large test example graph if our training data had also been large graphs rather than small ones? A structural causal model (SCM) for graph classification and regression tasks. In many applications, graphs are simply representations of a natural process rather than the true state of the process. In what follows we assume all graphs are simple, meaning all pairs of vertices have at most one edge and no self-loops are present. Our work defines an n-vertex attributed graph as a sample of a random variable G n := (X (obs) 1,1 , . . . , X (obs) n,n ), where X (obs) i,j ∈ Ω (e) encodes edges and edge attributes and X (obs) i,i ∈ Ω (v) encodes vertex attributes; we will assume Ω = Ω (v) = Ω (e) for simplicity. Consider a supervised task over a graph input G n , n ≥ 2, and its corresponding output Y . We describe the graph and target generation process with the help of a structural causal model (SCM) (Pearl, 2009, Definition 7.1.1) . We first consider a hidden random variable E with support in Z + that describes the graph-processing environment (see Introduction). We also consider an independent hidden random variable W ∈ D W that defines the true state of the data, which is independent of the environment variable E, with an appropriately defined space D W . In the SCM, these two variables are inputs to a deterministic graph-generation function g : Z + × D W × D Z → Ω n×n , for some appropriately defined space D Z , that outputs G (hid) N (obs) := (X (hid) 1,1 , . . . , X (hid) N (obs) ,N (obs) ) = g(E, W, Z X ), with N (obs) := η(E, W ), where Z X is another independent random variable that defines external noise (like measurement noise of a device). Equation (1) gives edge and vertex attributes of the graph G (hid) N (obs) in some arbitrary canonical form (Immerman & Lander, 1990) , where η is a function of both E and W that gives the number of vertices in the graph. To understand our definitions, consider the following simple example (divided into two parts). Erdős-Rényi example (part 1): For a single environment e, let n = η(e) be the (fixed) number of vertices of the graphs in our training data, and p = W be the probability that any two vertices of the graph have an edge. Finally, the variable Z X can be thought as the seed of a random number generator that is drawn n(n-1) 2 times to determine if two distinct vertices are connected by an edge. The above defines our training data as a set of Erdős-Rényi random graphs of size n with p = W . The data generation process in Equation (1) could leak information about W through the vertex ids (the order of the vertices). Rather than restricting how W acts on (X (hid) 1,1 , . . . , X (hid) N (obs) ,N (obs) ), we remedy this by modeling a random permutation to the vertex indices: G (obs) N (obs) := (X (obs) 1,1 , . . . , X (obs) N (obs) ,N (obs) ) = (X (hid) π(1),π(1) , . . . , X (hid) π(N (obs) ),π(N (obs) ) ),



. Recent work (e.g. Knyazev et al. (2019); Bouritsas et al. (2020); Xu et al. (2020)) explore extrapolations in real-world tasks, showcasing a growing interest in the underexplored topic of graph extrapolation tasks.

Figure 1: (a) The DAG of the structural causal model (SCM) of our graph extrapolation tasks where hashed (resp. white) vertices represent observed (resp. hidden) variables; (b) Illustrates the relationship between expressive model families and most-expressive extrapolation families.

