ON REPRESENTING LINEAR PROGRAMS BY GRAPH NEURAL NETWORKS

Abstract

Learning to optimize is a rapidly growing area that aims to solve optimization problems or improve existing optimization algorithms using machine learning (ML). In particular, the graph neural network (GNN) is considered a suitable ML model for optimization problems whose variables and constraints are permutation-invariant, for example, the linear program (LP). While the literature has reported encouraging numerical results, this paper establishes the theoretical foundation of applying GNNs to solving LPs. Given any size limit of LPs, we construct a GNN that maps different LPs to different outputs. We show that properly built GNNs can reliably predict feasibility, boundedness, and an optimal solution for each LP in a broad class. Our proofs are based upon the recently-discovered connections between the Weisfeiler-Lehman isomorphism test and the GNN. To validate our results, we train a simple GNN and present its accuracy in mapping LPs to their feasibilities and solutions.

1. INTRODUCTION

Applying machine learning (ML) techniques to accelerate optimization, also known as Learning to Optimize (L2O), is attracting increasing attention. It has been reported that L2O shows great potentials on both continuous optimization (Monga et al., 2021; Chen et al., 2021; Amos, 2022) and combinatorial optimization (Bengio et al., 2021; Mazyavkina et al., 2021) . Many of the L2O works train a parameterized model that takes the optimization problem as input and outputs information useful to classic algorithms, such as a good initial solution and branching decisions (Nair et al., 2020) , and some even directly generate an approximate optimal solution (Gregor & LeCun, 2010) . In these works, one is building an ML model to approximate the mapping from an explicit optimization instance either to its key properties or directly to its solution. The ability to achieve accurate approximation is called the representation power or expressive power of the model. When the approximation is accurate, the model can solve the problem or provide useful information to guide an optimization algorithm. This paper tries to address a fundamental but open theoretical problem for linear programming (LP): Which neural network can represent LP and predict its key properties and solution? (P0) To clarify, by solution we mean the optimal solution. Let us also remark that this question is not only of theoretical interest. Although currently neural network models may not be powerful enough to replace those mathematical-grounded LP solvers and obtain an exact LP solution, they are still useful in helping LP solvers from several perspectives, including warm-start and configuration. It requires that neural networks have sufficient power to recognize key characteristics of LPs. Some very recent papers (Deka & Misra, 2019; Pan et al., 2020; Chen et al., 2022) on DC optimal power flow (DC-OPF), an important type of LP, experimentally show the possibility of fast approximating LP solutions with deep neural networks. Practitioners may initialize an LP solver with those approximated solutions. We hope the answer to (P0) paves the way toward answering this question for other optimization types. Linear Programming (LP). LP is an important type of optimization problem with a wide range of applications, such as scheduling (Hanssmann & Hess, 1960) , signal processing (Candes & Tao, 2005) , machine learning (Dedieu et al., 2022) , etc. A general LP problem is defined as: min x∈R n c ⊤ x, s.t. Ax • b, l ≤ x ≤ u, (1.1) where A ∈ R m×n , c ∈ R n , b ∈ R m , l ∈ (R ∪ {-∞}) n , u ∈ (R ∪ {+∞}) n , and • ∈ {≤, =, ≥} m . Any LP problems must follow one of the following three cases (Bertsimas & Tsitsiklis): • Infeasible. The feasible set X F := {x ∈ R n : Ax • b, l ≤ x ≤ u} is empty. In another word, there is no point in R n that satisfies the constraints in LP (1.1). • Unbounded. The feasible set is non-empty, but the objective value can be arbitrarily good, i.e., unbounded from below. For any R > 0, there exists an x ∈ X F such that c ⊤ x < -R. • Feasible and bounded. There exists x * ∈ X F such that c ⊤ x * ≤ c ⊤ x for all x ∈ X F . Such x * is named as an optimal solution, and c ⊤ x * is the optimal objective value. Thus, considering (P0), an ideal ML model is expected to be able to predict the three key characteristics of LP: feasibility, boundedness, and one of its optimal solutions (if exists), by taking the LP features (A, b, c, l, u, •) as input. Actually, such input has a strong mathematical structure. If we swap the positions of the i, j-th variable in (1.1), elements in vectors b, c, l, u, • and columns of matrix A will be reordered. The reordered features ( Â, b, ĉ, l, û, •) actually represent an exactly equivalent LP problem with the original one (A, b, c, l, u, •). Such property is named as permutation invariance. If we do not explicitly restrict ML models with a permutation invariant structure, the models may overfit to the variable/constraint orders of instances in the training set. Motivated by this point, we adopt Graph Neural Networks (GNNs) that are permutation invariant naturally. GNN in L2O. GNN is a type of neural networks defined on graphs and widely applied in many areas, for example, recommender systems, traffic, chemistry, etc (Wu et al., 2020; Zhou et al., 2020) . Accelerating optimization solvers with GNNs attracts rising interest recently (Peng et al., 2021; Cappart et al., 2021) . Many graph-related optimization problems, like minimum vertex cover, traveling salesman, vehicle routing, can be represented and solved approximately with GNNs due to their problem structures (Khalil et al., 2017; Kool et al., 2019; Joshi et al., 2019; Drori et al., 2020) . Besides that, one may solve a general LP or mixed-integer linear programming (MILP) with the help of GNNs. Gasse et al. (2019) proposed to represent an MILP with a bipartite graph and apply a GNN on this graph to guide an MILP solver. Ding et al. (2020) proposed a tripartite graph to represent MILP. Since that, many approaches have been proposed to guide MILP or LP solvers with GNNs (Nair et al., 2020; Gupta et al., 2020; 2022; Shen et al., 2021; Khalil et al., 2022; Liu et al., 2022; Paulus et al., 2022; Qu et al., 2022; Li et al., 2022) . Although encouraging empirical results have been observed, theoretical foundations are still lack for this approach. Specifying (P0), we ask: Are there GNNs that can predict the feasibility, boundedness and an optimal solution of LP? (P1) Related works and contributions. To answer (P1), one needs the theory of separation power and representation power. Separation power of a neural network (NN) means its ability to distinguish two different inputs. In our settings, a NN with strong separation power means that it can outputs different results when it is applied on any two different LPs. Representation power of NN means its ability to approximate functions of interest. The theory of representation power is established upon the separation power. Only functions with strong enough separation power may possess strong representation power. The power of GNN has been studied in the literature (see Sato (2020) ; Jegelka (2022); Li & Leskovec (2022) for comprehensive surveys), and some theoretical efforts have been made to represent some graph-related optimization problems with GNN (Sato et al., 2019; Loukas, 2020) . However, there are still gaps to answer question (P1) since the relationships between characteristics of LP and properties of graphs are not well established. Our contributions are listed below: • (Separation Power). In the literature, it has been shown that the separation power of GNN is equal to the WL test (Xu et al., 2019; Azizian & Lelarge, 2021; Geerts & Reutter, 2022) . However, there exist many pairs of LPs that cannot be distinguished by the WL test. We show that those puzzling LP pairs share the same feasibility, boundedness, and even an optimal solution if exists. Thus, GNN has strong enough separation power. • (Representation Power). To the best of our knowledge, we established the first complete proof that GNN can universally represent a broad class of LPs. More precisely, we prove that, there exist GNNs that can be arbitrarily close to the following three mappings: LP → feasibility, LP → optimal objective value (-∞ if unbounded and ∞ if infeasible), and LP → an optimal solution (if exists), although they are not continuous functions and cannot be covered by the literature (Keriven & Peyré, 2019; Chen et al., 2019; Maron et al., 2019a; b; Keriven et al., 2021) . • (Experimental Validation). We design and conduct experiments that demonstrate the power of GNN on representing LP. The rest of this paper is organized as follows. In Section 2, we provide preliminaries, including related notions, definitions and concepts. In Section 3, we present our main theoretical results. The sketches of proofs are provided in Section4. We validate our results with numerical experiments in Section 5 and we conclude this paper with Section 6.

2. PRELIMINARIES

In this section, we present concepts and definitions that will be used throughout this paper. We first describe how to represent an LP with a weighted bipartite graph, then we define GNN on those LP-induced graphs, and finally we further clarify question (P1) with strict mathematical definitions.

2.1. LP REPRESENTED AS WEIGHTED BIPARTITE GRAPH

Before representing LPs with graphs, we first define the graph that we will adopt in this paper: weighted bipartite graph. A weighted bipartite graph G = (V ∪ W, E) consists of a vertex set V ∪ W that are divided into two groups V and W with V ∩ W = ∅, and a collection E of weighted edges, where each edge connects exactly one vertex in V and one vertex in W . Note that there is no edge connecting vertices in the same vertex group. E can also be viewed as a function E : V ×W → R. We use G m,n to denote the collection of all weighted bipartite graphs G = (V ∪W, E) with |V | = m and |W | = n. We always write V = {v 1 , v 2 , . . . , v m }, W = {w 1 , w 2 , . . . , w n }, and E i,j = E(v i , w j ), for i ∈ {1, 2, . . . , m}, j ∈ {1, 2, . . . , n}. One can equip each vertex with a feature vector. Throughout this paper, we denote h V i ∈ H V as the feature vector of vertex v i ∈ V and denote h W j ∈ H W as the feature vector of vertex w j ∈ W , where H V , H W are feature spaces. Then we define H V m := (H V ) m , H W n := (H W ) n and concatenate all the vertex features together as H = (h V 1 , h V 2 , . . . , h V m , h W 1 , h W 2 , . . . , h W n ) ∈ H V m × H W n . Finally, a weighted bipartite graph with vertex features is defined as a tuple (G, H) ∈ G m,n × H V m × H W n . With the concepts described above, one can represent an LP (1.1) as a bipartite graph (Gasse et al., 2019) : Each vertex in W represents a variable in LP and each vertex in V represents a constraint. The graph topology is defined with the matrix A in the linear constraint. More specifically, let us set • Vertex v i represents the i-th constraint in Ax • b, and vertex w j represents the j-th variable x j . • Information of constraints is involved in the feature of v i : h V i = (b i , • i ). • The space of constraint features is defined as H V := R × {≤, =, ≥}. • Information of variables is involved in the feature of w i : h W j = (c j , l j , u j ). • The space of variable features is defined as H W := R × (R ∪ {-∞}) × (R ∪ {+∞}). • The edge connecting v i and w j has weight E i,j = A i,j . Then an LP is represented as a graph (G, H) ∈ G m,n × H V m × H W n . In the rest of this paper, we coin such graphs as LP-induced Graphs or LP-Graphs for simplicity. We present an LP instance and its corresponding LP-graph in Figure 1 . min x∈R 2 x 1 + 2x 2 , s.t. x 1 + 2x 2 ≥ 1, 2x 1 + x 2 = 2, x 1 ≥ 0, x 2 ≥ -1. v 1 v 2 w 1 w 2 h V 1 = (1, ≥) h V 2 = (2, =) h W 1 = (1, 0, +∞) h W 2 = (2, -1, +∞) 1 2 2 1 Figure 1: An example of LP-graph 2.2 GRAPH NEURAL NETWORKS FOR LP The GNNs in this paper always take an LP-Graph as input and the output has two cases: • The output is a single real number. In this case, GNN is a function G m,n × H V m × H W n → R and usually used to predict the properties of the whole graph. • Each vertex in W has an output. Consequently, GNN is a function G m,n × H V m × H W n → R n . Since W represents variables in LP, GNN is used to predict properties of each variable in this case. Now we define the GNN structure precisely. First we encode the input features into the embedding space with learnable functions f V in : H V → R d0 and f W in : H W → R d0 : h 0,V i = f V in (h V i ), h 0,W j = f W in (h W j ), i = 1, 2, . . . , m, j = 1, 2, . . . , n, where h 0,V i , h 0,W j ∈ R d0 are initial embedded vertex features and d 0 is their dimension. Then we choose learnable functions f V l , f W l : R d l-1 → R d l and g V l , g W l : R d l-1 × R d l → R d l and update the hidden states withfoot_0 : h l,V i = g V l h l-1,V i , n j=1 E i,j f W l (h l-1,W j ) , i = 1, 2, . . . , m, h l,W j = g W l h l-1,W j , m i=1 E i,j f V l (h l-1,V i ) , j = 1, 2, . . . , n, where h l,V i , h l,W j ∈ R d l are vertex features at layer l (1 ≤ l ≤ L) and their dimensions are d 1 , • • • , d L respectively. The output layer of the single-output GNN is defined with a learnable function f out : R d L × R d L → R: y out = f out m i=1 h L,V i , n j=1 h L,W j . (2.4) The output of the vertex-output GNN is defined with f W out : R d L × R d L × R d L → R: y out (w j ) = f W out m i=1 h L,V i , n j=1 h L,W j , h L,W j , j = 1, 2, • • • , n. (2.5) We denote collections of single-output GNNs and vertex-output GNNs with F GNN and F W GNN , respectively: F GNN ={F : G m,n × H V m × H W n → R | F yields (2.1), (2.2), (2.3), (2.4)}, F W GNN ={F : G m,n × H V m × H W n → R n | F yields (2.1), (2. 2), (2.3), (2.5)}. (2.6) In practice, all the learnable functions in GNN f V in , f W in , f out , f W out , {f V l , f W l , g V l , g W l } L l=0 are usually parameterized with multi-linear perceptrons (MLPs). In our theoretical analysis, we assume for simplicity that those functions may take all continuous functions on given domains, following the settings in Azizian & Lelarge (2021, Section C.1) . Thanks to the universal approximation properties of MLP (Hornik et al., 1989; Cybenko, 1989) , one can extend our theoretical results by taking those learnable functions as large enough MLPs.

2.3. REVISITING QUESTION (P1)

With the definitions of LP-Graph and GNN above, we revisit the question (P1) and provide its precise mathematical description here. First we define three mappings that respectively describe the feasibility, optimal objective value and an optimal solution of an LP (if exists).

Feasibility mapping

The feasibility mapping is a classification function Φ feas : G m,n × H V m × H W n → {0, 1}, (2.7) where Φ feas (G, H) = 1 if the LP associated with (G, H) is feasible and Φ feas (G, H) = 0 otherwise.

Optimal objective value mapping

Denote Φ obj : G m,n × H V m × H W n → R ∪ {∞, -∞}, (2.8) as the optimal objective value mapping, i.e., for any H ) is the optimal objective value of the LP problem associated with (G, H). Remark 2.1. The optimal objective value of a LP problem can be a real number or ∞/ -∞. The "∞" case corresponds to infeasible problems, while the "-∞" case consists of LP problems whose objective function is unbounded from below in the feasible region. The preimage of all finite real numbers under Φ obj , Φ -1 obj (R), actually describes all LPs with finite optimal objective value. Remark 2.2. In the case that a LP problem has finite optimal objective value, it is possible that the problem admits multiple optimal solutions. However, the optimal solution with the smallest ℓ 2 -norm must be unique. In fact, if x ̸ = x ′ are two different solutions with ∥x∥ = ∥x ′ ∥, where ∥ • ∥ denotes the ℓ 2 -norm throughout this paper. Then 1 2 (x + x ′ ) is also an optimal solution due to the convexity of LPs, and it holds that (G, H) ∈ G m,n × H V m × H W n , Φ obj (G, ∥ 1 2 (x + x ′ )∥ 2 < 1 2 ∥x∥ 2 + 1 2 ∥x ′ ∥ 2 = ∥x∥ 2 = ∥x ′ ∥ 2 , i.e., ∥ 1 2 (x + x ′ )∥ < ∥x∥ = ∥x ′ ∥ , where the inequality is strict since x ̸ = x ′ . Therefore, x and x ′ cannot be optimal solutions with the smallest ℓ 2 -norm. Optimal solution mapping For any (G, H) ∈ Φ -1 obj (R), we have remarked before that the LP problem associated with (G, H) has a unique optimal solution with the smallest ℓ 2 -norm. Let Φ solu : Φ -1 obj (R) → R n , be the mapping that maps (G, H) ∈ Φ -1 obj (R) to the optimal solution with the smallest ℓ 2 -norm. Invariance and Equivariance We denote S m , S n as the group consisting of all permutations on vertex groups V, W respectively. In another word, S m involves all permutations on the constraints of LP and S n involves all permutations on the variables. In this paper, we say a function F : G m,n × H V m × H W n → R is invariant if it satisfies F (G, H) = F ((σ V , σ W ) * (G, H)) , ∀σ V ∈ S m , σ W ∈ S n , and a function F W : G m,n × H V m × H W n → R n is equivariant if it satisfies σ W (F W (G, H)) = F W ((σ V , σ W ) * (G, H))) , ∀σ V ∈ S m , σ W ∈ S n , where (σ V , σ W ) * (G, H) is the permuted graph obtained from reordering indices in (G, H) using (σ V , σ W ), which is the group action of S m × S n on G m,n × H V m × H W n . One can check that Φ feas , Φ obj , and any F ∈ F GNN are invariant, and that Φ solu and any F W ∈ F W GNN are equivariant. Question (P1) actually asks: Does there exist F ∈ F GNN that well approximates Φ feas or Φ obj ? And does there exist F W ∈ F W GNN that well approximates Φ solu ?

3. MAIN RESULTS

This section presents our main theorems that answer question (P1). As we state in the introduction, representation power is built upon separation power in our paper. We first present with the following theorem that GNN has strong enough separation power to represent LP. Theorem 3.1. Given any two LP instances (G, H), ( Ĝ, Ĥ) ∈ G m,n × H V m × H W n , if F (G, H) = F ( Ĝ, Ĥ ) for all F ∈ F GNN , then they share some common characteristics: (i) Both LP problems are feasible or both are infeasible, i.e., Φ feas (G, H) = Φ feas ( Ĝ, Ĥ). (ii) The two LP problems have the same optimal objective value, i.e., Φ obj (G, H) = Φ obj ( Ĝ, Ĥ). (iii) If both problems are feasible and bounded, they have the same optimal solution with the smallest ℓ 2 -norm up to a permutation, i.e., Φ solu (G, H) = σ W (Φ solu ( Ĝ, Ĥ)) for some σ W ∈ S n . Furthermore, if F W (G, H) = F W ( Ĝ, Ĥ), ∀ F W ∈ F W GNN , then (iii) holds without taking permuta- tions, i.e., Φ solu (G, H) = Φ solu ( Ĝ, Ĥ). This theorem demonstrates that the function spaces F GNN and F W GNN are rich enough to distinguish the characteristics of LP. Given two LP instances (G, H), ( Ĝ, Ĥ), as long as their feasibility or boundedness are different, there must exist F ∈ F GNN that can distinguish them: F (G, H) ̸ = F ( Ĝ, Ĥ). Moreover, as long as their optimal solutions with the smallest ℓ 2 -norm are different, there must exist F W ∈ F W GNN that can distinguish them: F W (G, H) ̸ = F W ( Ĝ, Ĥ). With Theorem 3.1 served as a foundation, we can prove that GNN can approximate the three mappings Φ feas , Φ obj and Φ solu to arbitrary precision. Before presenting those results, we first define some concepts of the space G m,n × H V m × H W n . Topology and measure Throughout this paper, we consider G m,n × H V m × H W n , where H V = R × {≤, =, ≥} and H W = R × (R ∪ {-∞}) × (R ∪ {+∞}) , as a topology space with product topology and a measurable space with product measure. It's enough to define the topology and measure of each part separately. Since each graph in this paper (without vertex features) can be represented with matrix A ∈ R m×n , the graph space G m,n is isomorphic to the Euclidean space R m×n : G m,n ∼ = R m×n and we equip G m,n with the standard Euclidean topology and the standard Lebesgue measure. The real spaces R in H W and H V are also equipped with the standard Euclidean topology and Lebesgue measure. All the discrete spaces {≤, =, ≥}, {-∞}, and {+∞} have the discrete topology, and all unions are disjoint unions. We equip those spaces with a discrete measure µ(S) = |S|, where |S| is the number of elements in a finite set S. This finishes the whole definition and we denote Meas(•) as the measure on G m,n × H V m × H W n . Theorem 3.2. Given any measurable X ⊂ G m,n × H V m × H W n with finite measure, for any ϵ > 0, there exists some F ∈ F GNN , such that Meas (G, H) ∈ X : I F (G,H)>1/2 ̸ = Φ feas (G, H) < ϵ, where I • is the indicator function, i.e., I F (G,H)>1/2 = 1 if F (G, H) > 1/2 and I F (G,H)>1/2 = 0 otherwise. This theorem shows that GNN is a good classifier for LP instances in X as long as X has finite measure. If we use F (G, H) > 1/2 as the criteria to predict the feasibility, the classification error rate is controlled by ϵ/Meas(X), where ϵ can be arbitrarily small. Furthermore, we show that GNN can perfectly fit any dataset with finite samples, which is presented in the following corollary. Corollary 3.3. For any D ⊂ G m,n × H V m × H W n with finite instances, there exists F ∈ F GNN that I F (G,H)>1/2 = Φ feas (G, H), ∀ (G, H) ∈ D. Besides the feasibility, GNN can also approximate ϕ obj and ϕ solu . Theorem 3.4. Given any measurable X ⊂ G m,n × H V m × H W n with finite measure, for any ϵ > 0, there exists F 1 ∈ F GNN such that Meas (G, H) ∈ X : I F1(G,H)>1/2 ̸ = I Φ obj (G,H)∈R < ϵ, (3.1) and for any ϵ, δ > 0, there exists F 2 ∈ F GNN such that Meas (G, H) ∈ X ∩ Φ -1 obj (R) : |F 2 (G, H) -Φ obj (G, H)| > δ < ϵ. (3.2) Recall the definition of Φ obj in (2.8) that it can take {±∞} as its value. Thus, Φ obj (G, H) ∈ R means the LP corresponding to (G, H) is feasible and bounded with a finite optimal objective value, and inequality (3.1) illustrates that GNN can identify those feasible and bounded LPs among the whole set X, up to a given precision ϵ. Inequality (3.2) shows that GNN can also approximate the optimal value. The measure of the set of LP instances of which the optimal value cannot be approximated with δ-precision is controlled by ϵ. The following corollary gives the results on dataset with finite instances. Corollary 3.5. For any D ⊂ G m,n × H V m × H W n with finite instances, there exists F 1 ∈ F GNN such that I F1(G,H)>1/2 = I Φ obj (G,H)∈R , ∀ (G, H) ∈ D, and for any δ > 0, there exists F 2 ∈ F GNN , such that |F 2 (G, H) -Φ obj (G, H)| < δ, ∀ (G, H) ∈ D ∩ Φ -1 obj (R). Finally, we show that GNN is able to represent the optimal solution mapping Φ solu . Theorem 3.6. Given any measurable X ⊂ Φ -1 obj (R) ⊂ G m,n × H V m × H W n with finite measure, for any ϵ, δ > 0, there exists some F W ∈ F W GNN , such that Meas ({(G, H) ∈ X : ∥F (G, H) -Φ solu (G, H)∥ > δ}) < ϵ. Corollary 3.7. Given any D ⊂ Φ -1 obj (R) ⊂ G m,n × H V m × H W n with finite instances, for any δ > 0, there exists F W ∈ F W GNN , such that ∥F (G, H) -Φ solu (G, H)∥ < δ, ∀ (G, H) ∈ D.

4. SKETCH OF PROOF

In this section, we will present a sketch of our proof lines and provide examples to show the intuitions. The full proof lines are presented in the appendix.

Separation power

The separation power measures a neural network with whether it generates different outcomes given different inputs, which serves as a foundation of the representation power. The separation power of GNNs is closely related to the Weisfeiler-Lehman (WL) test (Weisfeiler & Leman, 1968 ), a classical algorithm to identify whether two given graphs are isomorphic. To apply the WL test on LP-graphs, we describe a modified WL test in Algorithm 1, which is slightly different from the standard WL test. Algorithm 1 The WL test for LP-Graphsfoot_1 (denoted by WL LP ) Require: A graph instance (G, H) ∈ G m,n × H V m × H W n and iteration limit L > 0. 1: Initialize with C 0,V i = HASH 0,V (h V i ), C 0,W j = HASH 0,W (h W j ). 2: for l = 1, 2, • • • , L do 3: C l,V i = HASH l,V C l-1,V i , n j=1 E i,j HASH ′ l,W C l-1,W j . 4: C l,W j = HASH l,W C l-1,W j , m i=1 E i,j HASH ′ l,V C l-1,V i . 5: end for 6: return The multisets containing all colors {{C L,V i }} m i=0 , {{C L,W j }} n j=0 . We denote Algorithm 1 by WL LP (•), and we say that two LP-graphs (G, H), ( Ĝ, Ĥ) can be distinguished by Algorithm 1 if and only if there exist a positive integer L and injective hash functions {HASH l,V , HASH l,W } L l=0 ∪ {HASH ′ l,V , HASH ′ l,W } L l=1 such that WL LP (G, H), L ̸ = WL LP ( Ĝ, Ĥ), L . Unfortunately, there exist infinitely many pairs of non-isomorphic LP-graphs that cannot be distinguished by Algorithm 1. Figure 2 provide such an example. Since the separation power of GNNs is actually equal to the WL test (Xu et al., 2019) , one would expect that the limitation of the WL test might restrict GNNs from universally representing LP. However, any two LP-graphs that cannot be distinguished by the WL test must share some common characteristics even if they are not isomorphic. For example, let us consider the six LP instances in Figure 2 . In each of the three columns, the two non-isomorphic LP instances cannot be distinguished by the WL test. It can be checked that the two instances in the same column share some common characteristics. More specifically, both instances in the first column are infeasible; both instances in the second column are feasible but unbounded; both instances in the third column are feasible and bounded with (1/2, 1/2, 1/2, 1/2) being the optimal solution with the smallest ℓ 2 -norm. Actually, this phenomenon does not only happen on the instances in Figure 2 , but also serves as an universal principle for all LP instances. We summarize the results in the following theorem: Theorem 4.1. If (G, H), ( Ĝ, Ĥ) ∈ G m,n × H V m × H W n are not distinguishable by Algorithm 1, then Φ feas (G, H) = Φ feas ( Ĝ, Ĥ) and Φ obj (G, H) = Φ obj ( Ĝ, Ĥ). v 1 v 2 v 3 v 4 w 1 w 2 w 3 w 4 min x1 + x2 + x3 + x4, s.t. x1 + x2 = 1, x2 + x3 = 1, x3 + x4 = 1, x4 + x1 = 1, xj ≥ 1, 1 ≤ j ≤ 4. min x1 + x2 + x3 + x4, s.t. x1 + x2 ≤ 1, x2 + x3 ≤ 1, x3 + x4 ≤ 1, x4 + x1 ≤ 1, xj ≤ 1, 1 ≤ j ≤ 4. min x1 + x2 + x3 + x4, s.t. x1 + x2 = 1, x2 + x3 = 1, x3 + x4 = 1, x4 + x1 = 1, xj ≤ 1, 1 ≤ j ≤ 4. v 1 v 2 v 3 v 4 w 1 w 2 w 3 w 4 min x1 + x2 + x3 + x4, s.t. x1 + x2 = 1, x2 + x1 = 1, x3 + x4 = 1, x4 + x3 = 1, xj ≥ 1, 1 ≤ j ≤ 4. min x1 + x2 + x3 + x4, s.t. x1 + x2 ≤ 1, x2 + x1 ≤ 1, x3 + x4 ≤ 1, x4 + x3 ≤ 1, xj ≤ 1, 1 ≤ j ≤ 4. min x1 + x2 + x3 + x4, s.t. x1 + x2 = 1, x2 + x1 = 1, x3 + x4 = 1, x4 + x3 = 1, xj ≤ 1, 1 ≤ j ≤ 4. Figure 2 : LP-graphs that cannot be distinguished by the WL test. Since the features and neighbor information of {v i } and {w j } in the two graphs are equal, it holds for both graphs that C l,V 1 = • • • C l,V 4 and C l,W 1 = • • • C l,W 4 for all l ≥ 0, whatever the hash functions are chosen. Based on this graph pair, we construct three pairs of LPs that are both infeasible, both unbounded, both feasible bounded with the same optimal solution, respectively. Furthermore, if (G, H), ( Ĝ, Ĥ) ∈ Φ -1 obj (R), then it holds that Φ solu (G, H) = σ W (Φ solu ( Ĝ, Ĥ)) for some σ W ∈ S n . In other words, the above theorem guarantees the sufficient power of the WL test for separating LP problems with different characteristics, including feasibility, optimal objective value, and optimal solution with smallest ℓ 2 -norm (up to permutation). This combined with the following theorem, which states the equivalence of the separation powers of the WL test and GNNs, yields that GNNs also have sufficient separation power for LP-graphs in the above sense. Theorem 4.2. For any (G, H), ( Ĝ, Ĥ) ∈ G m,n × H V m × H W n , the followings are equivalent: (i) (G, H) and ( Ĝ, Ĥ) are not distinguishable by Algorithm 1. Representation power Based on the separation power of GNNs, one is able to investigate the representation/approximation power of GNNs. To prove Theorems 3.2, 3.4, and 3.6, we first determine the closure of F GNN or F W GNN in the space of invariant/equivariant continuous functions with respect to the sup-norm, which is also named as the universal approximation. The result of F GNN is stated as follows, where C(X, R) is the collection/algebra of all real-valued continuous function on X. The result of F W GNN can be found in the appendix. (ii) F (G, H) = F ( Ĝ, Ĥ), ∀ F ∈ F GNN . (iii) For any F W ∈ F W GNN , there exists σ W ∈ S n such that F W (G, H) = σ W (F W ( Ĝ, Ĥ)). Theorem 4.3. Let X ⊂ G m,n × H V m × H W n be a compact set. For any Φ ∈ C(X, R) that satisfies Φ(G, H) = Φ( Ĝ, Ĥ ) for all (G, H), ( Ĝ, Ĥ) ∈ X that are not distinguishable by Algorithm 1, and any ϵ > 0, there exists F ∈ F GNN such that sup (G,H)∈X |Φ(G, H) -F (G, H)| < ϵ. Theorem 4.3 can be viewed as an LP-graph version of results in Azizian & Lelarge (2021) ; Geerts & Reutter (2022) . Roughly speaking, graph neural networks can approximate any invariant continuous function whose separation power is upper bounded by that of WL test on compact domain with arbitrarily small error. Although our target mappings Φ feas , Φ obj , Φ solu are not continuous, we prove (in appendix) that they are measurable. Applying Lusin's theorem (Evans & Garzepy, 2018, Theorem 1.14), we show that GNN can be arbitrarily close to the target mappings except for a small domain.

5. NUMERICAL EXPERIMENTS

We present the numerical results that validate our theoretical results in this section. We generate LP instances with m = 10 and n = 50 that are possibly infeasible or feasible and bounded. To check whether GNN can predict feasibility, we generate three data sets with 100, 500, 2500 independent LP instances respectively, and call the solver wrapped in scipy.optimize.linprog to get the feasibility, optimal objective value and an optimal solution for each generated LP. To generate enough feasible and bounded LPs to check whether GNN can approximate the optimal objective value and optimal solution, we follow the same approach as before to generate LP randomly and discard those infeasible LPs until the number of LPs reach our requirement. We train GNNs to fit the three LP characteristics by minimizing the distance between GNN-output and those solver-generated labels. The building and the training of the GNNs are implemented using TensorFlow. The codes are modified from Gasse et al. (2019) and can be found in https://github.com/liujl11git/GNN-LP.git. We set L = 2 for all GNNs and those learnable functions f V in , f W in , f out , f W out , {f V l , f W l , g V l , g W l } L l=0 are all parameterized with MLPs. Details can be found in the appendix. Our results are reported in Figure 3 . All the errors reported in Figure 3 are training errors since generalization is out of the scope of this paper. In Figure 3a , the "rate of errors" means the proportion of instances with I F (G,H)>1/2 ̸ = Φ feas (G, H). This metric exactly equals to zeros as long as the number of parameters in GNN is large enough, which directly validates Corollary 3.3: the existence of GNNs that can accruately predict the feasibility of LP instances. With the three curves in Figure 3a combined together, we conclude that such principle does not violate as the number of samples increases. This consists with Theorem 3.2. Mean squared errors in Figures 3b and 3c are respectively defined as 3b and 3c validates Theorems 3.4 and 3.6 respectively. Note that all the instances used in Figure 3b are feasible and bounded. Thus, Figure 3b actually only validates (3.2) in Theorem 3.4. However, due to the fact that feasibility of an LP is equal to the boundedness of its dual problem, one may dualize each LP and use the conclusion of Figure 3a to validate (3.1) in Theorem 3.4. Some extra experimental results on generalization, i.e., the performance of the trained models on the test set, are presented in Appendix G. E (G,H) |F (G, H) -Φ feas (G, H)| 2 and E (G,H) ∥F (G, H) -Φ feas (G, H)∥ 2 . Therefore, Figures

6. CONCLUSIONS

In this work, we show that graph neural networks, as well as the WL test, have sufficient separation power to distinguish linear programming problems with different characteristics. In addition, GNNs can approximate LP feasibility, optimal objective value, and optimal solution with arbitrarily small errors on compact domains or finite datasets. These results guarantee that GNN is a proper class of machine learning models to represent linear programs, and hence contribute to the theoretical foundation in the learning-to-optimize community. Future directions include the size/complexity of GNNs and the generalization, that are not covered in our current theory but are of great importance. Another future topic is investing the representation power of graph neural networks for mixedinteger linear programming (MILP), which has been observed with promising experimental results in the literature.

A WEISFEILER-LEHMAN (WL) TEST AND COLOR REFINEMENT

The WL test can be viewed as a coloring refinement procedure if there are no collisions of hash functions and their weighted averages. More specifically, each vertex is colored initially according to the group it belongs to and its feature -two vertices have the same color if and only if they are in the same vertex group and have the same feature. The initial colors are denoted as C 0,V 1 , C 0,V 2 , . . . , C 0,V m , C 0,W 1 , C 0,W 2 , . . . , C 0,W n . Then at iteration l, the set of vertices with the same color at iteration l -1 are further partitioned into several subsets according to the colors of their neighbours -two vertices v i and v i ′ are in the same subset if and only if C l-1,V i = C l-1,V i ′ and for any C ∈ {C l-1,W j : 1 ≤ j ≤ n}, C l-1,W j =C E i,j = C l-1,W j =C E i ′ ,j , and it is similar for vertices w j and w j ′ . After such partition/refinement, vertices are associated with the same color if and only if they are in the same subset, which is the coloring at iteration l. This procedure is terminated if the refinement is trivial, meaning that no sets with the same color are partitioned into at least two subsets, i.e., the coloring is stable. For more information about color refinement, we refer to Berkholz et al. (2017); Arvind et al. (2015; 2017) . We then discuss the stable coloring that Algorithm 1 will converge to, for which we made the following definition, where S = {S 1 , S 2 , . . . , S s } is called a partition of a set S if S 1 ∪S 2 ∪• • •∪S s = S and S i ∩ S i ′ = ∅, ∀ 1 ≤ i < i ′ ≤ s. Definition A.1 (Stable Partition Pair of Vertices). Let G = (V ∪ W, E) be a weighted bi- partite graph with V = {v 1 , v 2 , . . . , v m }, W = {w 1 , w 2 , . . . , w n }, and vertex features H = (h V 1 , h V 2 , . . . , h V m , h W 1 , h W 2 , . . . , h W n ) , and let I = {I 1 , I 2 , . . . , I s } and J = {J 1 , J 2 , . . . , J t } be partitions of {1, 2, . . . , m} and {1, 2, . . . , n}, respectively. We say that (I, J ) is a stable partition pair of vertices for the graph G if the followings are satisfied: (i) h V i = h V i ′ holds if i, i ′ ∈ I p for some p ∈ {1, 2, . . . , s}. (ii) h W j = h W j ′ holds if j, j ′ ∈ J q for some q ∈ {1, 2, . . . , t}. (iii) For any p ∈ {1, 2, . . . , s}, q ∈ {1, 2, . . . , t}, and i, i ′ ∈ I p , j∈Jq E i,j = j∈Jq E i ′ ,j . (iv) For any p ∈ {1, 2, . . . , s}, q ∈ {1, 2, . . . , t}, and j, j ′ ∈ J q , i∈Ip E i,j = i∈Ip E i,j ′ . We denote (I l , J l ) as the partition pair corresponding the coloring at iteration l of Algorithm 1. Suppose that there are no collisions. Then it is clear that (I l+1 , J l+1 ) is finer than (I l , J l ), denoted as (I l+1 , J l+1 ) ⪯ (I l , J l ), which means that for any I ∈ I l+1 and any J ∈ J l+1 , there exist I ′ ∈ I l and J ′ ∈ J l such that I ⊂ I ′ and J ⊂ J ′ . In addition, (I l+1 , J l+1 ) = (I l , J l ) if and only if (I l , J l ) is a stable partition pair of vertices. Note that there is at most O(|V | + |W |) iterations leading to strictly finer partition pair, i.e, (I l+1 , J l+1 ) ⪯ (I l , J l ) but (I l+1 , J l+1 ) ̸ = (I l , J l ). We can immediately obtain the following result: Theorem A.2. If there are no collision of hash functions and their weighted averages, then Algorithm 1 terminates at a stable partition pair of vertices in O(|V | + |W |) iterations. Furthermore, for every (G, H) ∈ G m,n × H V m × H W n , the coarsest stable partition pair of vertices exists and is unique, which can be proved using techniques similar to the proof of Berkholz et al. (2017, Proposition 3) . Algorithm 1 terminates at the unique coarsest stable partition pair. This is because that the coloring gin each iteration of Algorithm 1 is always coarser than the unique coarsest stable partition pair (see Berkholz et al. (2017, Proposition 2) ).

B SEPARATION POWER OF THE WL TEST

This section gives the proof and some corollaries of Theorem 4.1. First we present some definitions and lemmas, from which the proof of Theorem 4.1 can be immediately derived. Published as a conference paper at ICLR 2023 Definition B.1. Given (G, H), ( Ĝ, Ĥ) ∈ G m,n × H V m × H W n , we say that (G, H) and ( Ĝ, Ĥ) can be distinguished by the WL test if there exists some L ∈ N and some choices of hash functions, HASH 0,V , HASH 0,W , HASH l,V , HASH l,W , HASH ′ l,V , and HASH ′ l,W , for l = 1, 2, . . . , L, such that the multisets of colors at the L-th iteration of the WL test are different for (G, H) and ( Ĝ, Ĥ). Let ∼ be an equivalence relationship on G m,n × H V m × H W n defined via: (G, H) ∼ ( Ĝ, Ĥ) if and only if they can not be distinguished by the WL test. It is clear that (G, H) ∼ ( Ĝ, Ĥ) if they are isometric, i.e., there exist two permutations, σ V : {1, 2, . . . , m} → {1, 2, . . . , m} and σ W : {1, 2, . . . , n} → {1, 2, . . . , n}, such that E σ V (i),σ W (j) = Êi,j , h V σ V (i) = ĥV i , and h W σ W (j) = ĥW j , for any i ∈ {1, 2, . . . , m} and j ∈ {1, 2, . . . , n}. However, not every pair of WL-indistinguishable graphs consists of isometric ones; see Figure 2 for an example. However, for LP problems that cannot be distinguished by the WL test will share some common properties, even if their associated graphs are not isomorphic. Lemma B.2. If two weighted bipartite graphs with vertex features corresponding to two LP problems are indistinguishable by the WL test, then either both problems are feasible or both are infeasible. In other words, the WL test can distinguish two LP problems if one of them is feasible while the other one is infeasible. Proof of Lemma B.2. Let us consider two LP problems: min x∈R n c ⊤ x, s.t. Ax • b, l ≤ x ≤ u, (B.1) and min x∈R n ĉ⊤ x, s.t. Âx • b, l ≤ x ≤ û. (B.2) Let (G, H) and ( Ĝ, Ĥ), where G = (V ∪ W, E) and Ĝ = (V ∪ W, Ê), be the weighted bipartite graphs with vertex features corresponding to (B.1) and (B.2), respectively. Since (G, H) ∼ ( Ĝ, Ĥ), for some choice of hash functions with no collision during Algorithm 1 for (G, H) and ( Ĝ, Ĥ), Theorem A.2 guarantees that Algorithm 1 outputs the same stable coloring for (G, H) and ( Ĝ, Ĥ) up to permutation. More specifically, after doing some permutation, there exist I = {I 1 , I 2 , . . . , I s } and J = {J 1 , J 2 , . . . , J t } that are partitions of {1, 2, . . . , m} and {1, 2, . . . , n}, respectively, such that the followings hold: • (b i , • i ) = ( bi , •i ) and is independent of i ∈ I p , for any p ∈ {1, 2, . . . , s}. • (l j , u j ) = ( lj , ûj ) and is independent of j ∈ J q for any q ∈ {1, 2, . . . , t}. • For any p ∈ {1, 2, . . . , s} and q ∈ {1, 2, . . . , t}, j∈Jq A i,j = j∈Jq Âi,j and is independent of i ∈ I p . • For any p ∈ {1, 2, . . . , s} and q ∈ {1, 2, . . . , t}, i∈Ip A i,j = i∈Ip Âi,j and is independent of j ∈ J q . Suppose that the problem (B.1) is feasible with x ∈ R n be some point in the feasible region. Define y ∈ R t via y q = 1 |Jq| j∈Jq x j and x ∈ R n via xj = y q , j ∈ J q . Fix any p ∈ {1, 2, . . . , s} and some i 0 ∈ I p . It holds for any i ∈ I p that n j=1 A i,j x j • i b i , i.e., t q=1 j∈Jq A i,j x j • i0 b i0 , which implies that 1 |I p | i∈Ip t q=1 j∈Jq A i,j x j = 1 |I p | t q=1 j∈Jq   i∈Ip A i,j   x j • i0 b i0 . Notice that i∈Ip A i,j is constant for j ∈ J q , ∀ q ∈ {1, 2, . . . , t}. Let us denote α q = i∈Ip A i,j = i∈Ip Âi,j for any j ∈ J q . Then it holds that 1 |I p | t q=1 j∈Jq α q x j = 1 |I p | t q=1 j∈Jq α q y q = 1 |I p | t q=1 j∈Jq   i∈Ip Âi,j   y q • i0 b i0 . Note that 1 |I p | t q=1 j∈Jq   i∈Ip Âi,j   y q = 1 |I p | i∈Ip t q=1   j∈Jq Âi,j   y q , and that j∈Jq Âi,j is constant for i ∈ I p . So one can conclude that n j=1 Âi,j xj = t q=1 j∈Jq Âi,j xj = t q=1   j∈Jq Âi,j   y q • i b i , ∀ i ∈ I p , which leads to Âx • b. It can also be seen that l = l ≤ x ≤ u ≤ û. Therefore, x is feasible for (B.2). We have shown above that the feasibility of (B.1) implies the feasibility of (B.2). The inverse is also true by the same reasoning. Hence, we complete the proof. Lemma B.3. If two weighted bipartite graphs with vertex features corresponding to two LP problems are indistinguishable by the WL test, then these two problems share the same optimal objective value (could be ∞ or -∞). Proof of Lemma B.3. If both problems are infeasible, then their optimal objective values are both ∞. We then consider the case that both problems are feasible. We use the same setting and notations as in Lemma B.2, and in addition we have that c j = ĉj , which is part of h W j = ĥW j , is independent of j ∈ J q for any q ∈ {1, 2, . . . , t}. Suppose that x is an feasible solution to the problem (B.1) and let x ∈ R n be defined via xj = 1 |Jq| j ′ ∈Jq x j ′ , j ∈ J q . It is guaranteed by the proof of Lemma B.2 that x is a feasible solution to (B.2). One can also see that c ⊤ x = ĉ⊤ x. Since this holds for any feasible solution x to (B.1), the optimal value of the objective function for (B.2) is smaller than or equal to that for (B.1). The inverse is also true and the proof is completed. Lemma B.4. Suppose that two weighted bipartite graphs with vertex features corresponding to two LP problems are indistinguishable by the WL test and the their optimal objective values are both finite. Then these two problems have the same optimal solution with the smallest ℓ 2 -norm, up to permutation. Proof of Lemma B.4. We work with the same setting as in Lemma B.3, where permutations have already been applied. Let x and x ′ be the optimal solution to (B.1) and (B.2) with the smallest ℓ 2 -norm, respectively. (Recall that the optimal solution to a LP problem with the smallest ℓ 2 -norm is unique, see Remark 2.2.) Let x ∈ R n be defined via xj = 1 |Jq| j ′ ∈Jq x j ′ for j ∈ J q , q = 1, 2, . . . , t. According to the arguments in the proof of Lemma B.2 and Lemma B.3, x is an optimal solution to (B.2). The minimality of x ′ yields that ∥x ′ ∥ 2 2 ≤ ∥x∥ 2 2 = t q=1 |J q |   1 |J q | j∈Jq x j   2 = t q=1 1 |J q |   j∈Jq x j   2 ≤ t q=1 j∈Jq x 2 j = ∥x∥ 2 2 , (B. 3) which implies ∥x ′ ∥ ≤ ∥x∥. The converse ∥x∥ ≤ ∥x ′ ∥ is also true. Therefore, we must have ∥x∥ = ∥x ′ ∥ and hence, the inequalities in (B.3) must hold as equalities. Then one can conclude that x j = x j ′ for any j, j ′ ∈ J q and q = 1, 2, . . . , t, which leads to x = x. Furthermore, it follows from ∥x ′ ∥ = ∥x∥ and the uniqueness of x ′ (see Remark 2.2) that x ′ = x = x, which completes the proof. One corollary one can see from the proof of Lemma B.4 is that the components of the optimal solution with the smallest ℓ 2 -norm must be the same if the two corresponding vertices have the same color in the WL test. Corollary B.5. Let (G, H) be a weighted bipartite graph with vertex features and let x be the optimal solution to the corresponding LP problem with the smallest ℓ 2 -norm. Suppose that for some j, j ′ ∈ {1, 2, . . . , n}, one has C l,W j = C l,W j ′ for any l ∈ N and any choices of hash functions, then x j = x j ′ . Let us also define another equivalence relationship on G m,n × H V m × H W n where colors of w 1 , w 2 , . . . , w j with ordering (not just multisets) are considered: H ) and ( Ĝ, Ĥ) are not in the same equivalence class of W ∼ if and only if there exist some L ∈ N and some hash functions HASH 0,V , HASH 0,W , HASH l,V , HASH l,W , HASH ′ l,V , and  Definition B.6. Given (G, H), ( Ĝ, Ĥ) ∈ G m,n × H V m × H W n , (G, HASH ′ l,W , l = 1, 2, . . . , L, such that {{C L,V 1 , C L,V 2 , . . . , C L,V m }} ̸ = {{ ĈL,V 1 , ĈL,V 2 , . . . , ĈL,V m }} or C l,W j ̸ = Ĉl,W j for some j ∈ {1, 2, . . . , n}. It is clear that (G, H) W ∼ ( Ĝ, Ĥ) implies (G, H) ∼ ( Ĝ, Ĥ).

C SEPARATION POWER OF GRAPH NEURAL NETWORKS

This section aims to prove Theorem 4.2, i.e., the separation power of GNNs is equivalent to that of the WL test. Similar results can be found in previous literature, see e.g. Xu et al. (2019) ; Azizian & Lelarge (2021) ; Geerts & Reutter (2022) . We first introduce some lemmas that can directly imply Theorem 4.2. The lemma below, similar to Xu et al. (2019, Lemma 2) , states that the separation power of GNNs is at most that of the WL test. Lemma C.1. Let (G, H), ( Ĝ, Ĥ) ∈ G m,n × H V m × H W n . If (G, H) ∼ ( Ĝ, Ĥ), then for any F W ∈ F W GNN , there exists a permutation σ W ∈ S n such that F W (G, H) = σ W (F W ( Ĝ, Ĥ)). Proof of Lemma C.1. First we describe the sketch of our proof. The assumption (G, H) ∼ ( Ĝ, Ĥ) implies that, if we apply the WL test on (G, H) and ( Ĝ, Ĥ), the test results should be exactly the same whatever the hash functions in the WL test we choose. In the first step, we define a set of hash functions that are injective on all possible inputs. Second, we show that, if we apply an arbitrarily chosen GNN: F W ∈ F W GNN on (G, H) and ( Ĝ, Ĥ), the vertex features of the two graphs are exactly the same up to permutation, given the fact that the WL test results are the same. Finally, it concludes that F W (G, H) should be the same with F W ( Ĝ, Ĥ) up to permutation. Let us first define hash functions. We choose HASH 0,V and HASH 0,W that are injective on the following sets (not multisets) respectively: {h V 1 , . . . , h V m , ĥV 1 , . . . , ĥV m } and {h W 1 , . . . , h W n , ĥW 1 , . . . , ĥW n }. Let {C l-1,V i } m i=1 , {C l-1,W j } n j=1 and { Ĉl-1,V i } m i=1 , { Ĉl-1,W j } n j=1 be the vertex colors in the (l -1)th iteration (1 ≤ l ≤ L) in the WL test for (G, H) and ( Ĝ, Ĥ) respectively. Define two sets (not multisets) that collect different colors: C V l-1 = {C l-1,V 1 , . . . , C l-1,V m , Ĉl-1,V 1 , . . . , Ĉl-1,V m }, and C W l-1 = {C l-1,W 1 , . . . , C l-1,W n , Ĉl-1,W 1 , . . . , Ĉl-1,W n }. The hash function HASH ′ l,V and HASH ′ l,W are chosen such that the outputs are located in some linear spaces and that {HASH ′ l,V (C) : C ∈ C V l-1 } and {HASH ′ l,W (C) : C ∈ C W l-1 } are both linearly independent. Finally, we choose hash functions HASH l,V and HASH l,W such that HASH l,V is injective on the set (not multiset)      C l-1,V i , n j=1 E i,j HASH ′ l,W C l-1,W j   : 1 ≤ i ≤ m    ∪      Ĉl-1,V i , n j=1 E i,j HASH ′ l,W Ĉl-1,W j   : 1 ≤ i ≤ m    , and that HASH l,W is injective on the set (not multiset) C l-1,W j , n i=1 E i,j HASH ′ l,V C l-1,V i : 1 ≤ j ≤ n ∪ Ĉl-1,W j , n i=1 E i,j HASH ′ l,V Ĉl-1,V i : 1 ≤ j ≤ n . Those hash functions give the vertex colors at the next iteration (l-th layer): {C l,V i } m i=1 , {C l,W j } n j=1 and { Ĉl,V i } m i=1 , { Ĉl,W j } n j=1 . Consider any F W ∈ F W GNN and let {h l-1,V i } m i=1 , {h l-1,W j } n j=1 and { ĥl-1,V i } m i=1 , { ĥl-1,W j } n j=1 be the vertex features in the l-th layer (0 ≤ l ≤ L) of the graph neural network F W . (Update rule refers to equations (2.1),(2.2),(2.3),(2.5)) We aim to prove by induction that for any l ∈ {0, 1, . . . , L}, the followings hold: (i) C l,V i = C l,V i ′ implies h l,V i = h l,V i ′ , for 1 ≤ i, i ′ ≤ m; (ii) Ĉl,V i = Ĉl,V i ′ implies ĥl,V i = ĥl,V i ′ , for 1 ≤ i, i ′ ≤ m; (iii) C l,V i = Ĉl,V i ′ implies h l,V i = ĥl,V i ′ , for 1 ≤ i, i ′ ≤ m; (iv) C l,W j = C l,W j ′ implies h l,W j = h l,W j ′ , for 1 ≤ j, j ′ ≤ n; (v) Ĉl,W j = Ĉl,W j ′ implies ĥl,W j = ĥl,W j ′ , for 1 ≤ j, j ′ ≤ n; (vi) C l,W j = Ĉl,W j ′ implies h l,W j = ĥl,W j ′ , for 1 ≤ j, j ′ ≤ n. The above claims (i)-(vi) are clearly true for l = 0 due to the injectivity of HASH 0,V and HASH 0,W . Now we assume that (i)-(vi) are true for some l - 1 ∈ {0, 1, . . . , L -1}. Suppose that C l,V i = C l,V i ′ , i.e., HASH l,V   C l-1,V i , n j=1 E i,j HASH ′ l,W C l-1,W j   = HASH l,V   C l-1,V i ′ , n j=1 E i ′ ,j HASH ′ l,W C l-1,W j   , for some 1 ≤ i, i ′ ≤ m. It follows from the injectivity of HASH l,V that C l-1,V i = C l-1,V i ′ , (C.1) and n j=1 E i,j HASH ′ l,W C l-1,W j = n j=1 E i ′ ,j HASH ′ l,W C l-1,W j . According to the linearly independent property of HASH ′ l,W , the above equation implies that C l-1,W j =C E i,j = C l-1,W j =C E i ′ ,j , ∀ C ∈ C W l-1 . (C.2) Note that the induction assumption guarantees that h l-1,W j = h l-1,W j ′ as long as C l-1,W j = C l-1,W j ′ . So one can assign for each C ∈ C W l-1 some h(C) ∈ R d l-1 such that h l-1,W j = h(C) as long as C l-1,W j = C for any 1 ≤ j ≤ n. Therefore, it follows from (C.2) that n j=1 E i,j f W l (h l-1,W j ) = C∈C W l-1 C l-1,W j =C E i,j f W l (h(C)) = C∈C W l-1 C l-1,W j =C E i ′ ,j f W l (h(C)) = n j=1 E i ′ ,j f W l (h l-1,W j ). Note also that (C.1) and the induction assumption lead to h l-1,V i = h l-1,V i ′ . Then one can conclude that h l,V i = g V l   h l-1,V i , n j=1 E i,j f W l (h l-1,W j )   = g V l   h l-1,V i ′ , n j=1 E i ′ ,j f W l (h l-1,W j )   = h l,V i ′ . This proves the claim (i) for l. The other five claims can be proved using similar arguments. Therefore, we obtain from (G, H) ∼ ( Ĝ, Ĥ) that h L,V 1 , h L,V 2 , . . . , h L,V m = ĥL,V 1 , ĥL,V 2 , . . . , ĥL,V m , and that h L,W 1 , h L,W 2 , . . . , h L,W n = ĥL,W 1 , ĥL,W 2 , . . . , ĥL,W n . By the definition of the output layer, the above conclusion guarantees that F W (G, H) = σ W (F W ( Ĝ, Ĥ)) for some σ W ∈ S n . Lemma C.2. Let (G, H), ( Ĝ, Ĥ) ∈ G m,n × H V m × H W n . Suppose that for any F W ∈ F W GNN , there exists a permutation σ W ∈ S n such that F W (G, H) = σ W (F W ( Ĝ, Ĥ)). Then F (G, H) = F ( Ĝ, Ĥ) holds for any F ∈ F GNN . Proof of Lemma C.2. Pick an arbitrary F ∈ F GNN . We choose F W ∈ F W GNN such that F W (G ′ , H ′ ) = (F (G ′ , H ′ ), . . . , F (G ′ , H ′ )) ⊤ ∈ R n , ∀ (G ′ , H ′ ) ∈ G m,n × H V m × H W n . Note that every entry in the output of F W is equal to the output of F . Thus, it follows from F W (G, H) = σ W (F W ( Ĝ, Ĥ)) that F (G, H) = F ( Ĝ, Ĥ). The next lemma is similar to Xu et al. (2019, Theorem 3) and states that the separation power of GNNs is at least that of the WL test. Lemma C.3. Let (G, H), ( Ĝ, Ĥ) ∈ G m,n × H V m × H W n . If F (G, H) = F ( Ĝ, Ĥ) holds for any F ∈ F GNN , then (G, H) ∼ ( Ĝ, Ĥ). Proof of Lemma C.3. It suffices to prove that, if (G, H) can be distinguished from ( Ĝ, Ĥ) by the WL test, then there exists F ∈ F GNN , such that F (G, H) ̸ = F ( Ĝ, Ĥ). The distinguish-ability of the WL test implies that there exists L ∈ N and hash functions, HASH 0,V , HASH 0,W , HASH l,V , HASH l,W , HASH ′ l,V , and HASH ′ l,W , for l = 1, 2, . . . , L, such that C L,V 1 , C L,V 2 , . . . , C L,V m ̸ = ĈL,V 1 , ĈL,V 2 , . . . , ĈL,V m , (C.3) or C L,W 1 , C L,W 2 , . . . , C L,W n ̸ = ĈL,W 1 , ĈL,W 2 , . . . , ĈL,W n , (C.4) We aim to construct some GNNs such that the followings hold for any l = 0, 1, . . . , L: (i) h l,V i = h l,V i ′ implies C l,V i = C l,V i ′ , for 1 ≤ i, i ′ ≤ m; (ii) ĥl,V i = ĥl,V i ′ implies Ĉl,V i = Ĉl,V i ′ , for 1 ≤ i, i ′ ≤ m; (iii) h l,V i = ĥl,V i ′ implies C l,V i = Ĉl,V i ′ , for 1 ≤ i, i ′ ≤ m; (iv) h l,W j = h l,W j ′ implies C l,W j = C l,W j ′ , for 1 ≤ j, j ′ ≤ n; (v) ĥl,W j = ĥl,W j ′ implies Ĉl,W j = Ĉl,W j ′ , for 1 ≤ j, j ′ ≤ n; (vi) h l,W j = ĥl,W j ′ implies C l,W j = Ĉl,W j ′ , for 1 ≤ j, j ′ ≤ n. It is clear that the above conditions (i)-(vi) hold for l = 0 as long as we choose f V in and f W in that are injective on the following two sets (not multisets) respectively: {h V 1 , . . . , h V m , ĥV 1 , . . . , ĥV m } and {h W 1 , . . . , h W n , ĥW 1 , . . . , ĥW n }. We then assume that (i)-(vi) hold for some 0 ≤ l -1 < L, and show that these conditions are also satisfied for l if we choose f V l , f W l , g V i , g W l properly. Let us consider the set (not multiset): {α 1 , α 2 , . . . , α s } ⊂ R d l-1 that collects all different values in h l-1,W 1 , h l-1,W 2 , . . . , h l-1,W n , ĥl-1,W 1 , ĥl-1,W 2 , . . . , ĥl-1,W n . Let d l ≥ s and let e d l p = (0, . . . , 0, 1, 0, . . . , 0) be the vector in R d l with the p-th entry being 1 and all other entries being 0, for 1 ≤ p ≤ s. Choose f W l : R d l-1 → R d l as a continuous function satisfying f W l (α p ) = e d l p , p = 1, 2, . . . , s, and choose g V l : R d l-1 × R d l → R d l that is continuous and is injective when restricted on the set (not multiset)      h l-1,V i , n j=1 E i,j f W l (h l-1,W j )   : 1 ≤ i ≤ m    ∪     ĥ l-1,V i , n j=1 Êi,j f W l ( ĥl-1,W j )   : 1 ≤ i ≤ m    . Noticing that n j=1 E i,j f W l (h l-1,W j ) = s p=1    h l-1,W j =αp E i,j    e d l p , and that {e d l 1 , e d l 2 , . . . , e d l s } is linearly independent, one can conclude that h l,V i = h l,V i ′ if and only if h l-1,V i = h l-1,V i ′ and n j=1 E i,j f W l (h l-1,W j ) = n j=1 E i ′ ,j f W l (h l-1,W j ), where the second condition is equivalent to h l-1,W j =αp E i,j = h l-1,W j =αp E i ′ ,j , ∀ p ∈ {1, 2, . . . , s}. This, as well as the condition (iv) for l -1, implies that n j=1 E i,j HASH ′ l,W C l-1,W j = n j=1 E i ′ ,j HASH ′ l,W C l-1,W j , and hence that C l,V i = C l,V i ′ by using h l-1,V i = h l-1,V i and condition (i) for l -1. Therefore, we know that (i) is satisfied for l, and one can show (ii) and (iii) for l using similar arguments by taking d l large enough. In addition, f V l and g W l can also be chosen in a similar way such that (iv)-(vi) are satisfied for l. Combining (C.3), (C.4), and condition (i)-(iv) for L, we obtain that h L,V 1 , h L,V 2 , . . . , h L,V m ̸ = ĥL,V 1 , ĥL,V 2 , . . . , ĥL,V m , (C.5) or h L,W 1 , h L,W 2 , . . . , h L,W n ̸ = ĥL,W 1 , ĥL,W 2 , . . . , ĥL,W n . Without loss of generality, we can assume that (C.5) holds. Consider the set (not multiset) {β 1 , β 2 , . . . , β t } ⊂ R d L , that collects all different values in h L,V 1 , h L,V 2 , . . . , h L,V m , ĥL,V 1 , ĥL,V , . . . , ĥL,V m . Let k > 1 be a positive integer that is greater than the maximal multiplicity of an element in the multisets {{h L,V 1 , h L,V 2 , . . . , h L,V m }} and {{ ĥL,V 1 , ĥL,V 2 , . . . , ĥL,V m }}. There exists a continuous function φ : R d L → R such that φ(β q ) = k q for q = 1, 2, . . . , t, and due to (C.5) and the fact that the way of writing an integer as k-ary expression is unique, it hence holds that m i=1 φ(h L,V i ) ̸ = m i=1 φ( ĥL,V i ). Set the dimension of (L + 1)-th layer as 1: d L+1 = 1, and set f L+1,V = 0, f L+1,W = 0, g L+1,V (h, 0) = φ(h), and g L+1,W = 0. Then we have h  L+1,V i = φ(h L,V i ), ĥL+1,V i = φ( ĥL,V f out   m i=1 h L+1,V i , n j=1 h L+1,W j   = m i=1 φ(h L,V i ) ̸ = m i=1 φ( ĥL,V i ) = f out   m i=1 ĥL+1,V i , n j=1 ĥL+1,W j   , which guarantees the existence of F ∈ F GNN that has L + 1 layers and satisfies F (G, H) ̸ = F ( Ĝ, Ĥ). (ii) For any F W ∈ F W GNN , it holds that F W (G, H) = F W ( Ĝ, Ĥ). Proof of Corollary C.4. The proof follows similar lines as in the proof of Theorem 4.2 with the difference that there is no permutation on {w 1 , w 2 , . . . , w n }. In addition to the separation power of GNNs for two weighted bipartite graphs with vertex features, one can also obtain results on separating different vertices in one weighted bipartite graph with vertex features. Corollary C.5. For any weighted bipartite graph with vertex features (G, H) and any j, j ′ ∈ {1, 2, . . . , n}, the followings are equivalent: (i) C l,W j = C l,W j ′ holds for any l ∈ N and any choice of hash functions. (ii) F W (G, H) j = F W (G, H) j ′ , ∀ F W ∈ F W GNN . Proof of Corollary C.5. "(i) =⇒ (ii)" and "(ii) =⇒ (i)" can be proved using similar arguments in the proof of Lemma C.1 and Lemma C.3, respectively.

D UNIVERSAL APPROXIMATION OF F GNN

This section provides the proof of Theorem 4.3. The main mathematical tool used in the proof is the Stone-Weierstrass theorem: Theorem D.1 (Stone-Weierstrass theorem (Rudin, 1991, Section 5.7) ). Let X be a compact Hausdorff space and let F ⊂ C(X, R) be a subalgebra. If F separates points on X, i.e., for any x, x ′ ∈ X with x ̸ = x ′ , there exists F ∈ F such that F (x) ̸ = F (x ′ ), and 1 ∈ F, then F is dense in C(X, R) with the topology of uniform convergence. Proof of Theorem 4.3. Let π : X → X/ ∼ be the quotient map, where π(X) = X/ ∼ is equipped with the quotient topology. For any F ∈ F GNN , since F : X → R is continuous and by Theorem 4.2, F (G, H) = F ( Ĝ, Ĥ), ∀ (G, H) ∼ ( Ĝ, Ĥ), there exists a unique continuous F : π(X) → R such that F = F • π. Set FGNN = F : F ∈ F GNN ⊂ C(π(X), R). In addition, the assumption on Φ, i.e, Φ(G, H) = Φ( Ĝ, Ĥ), ∀ (G, H) ∼ ( Ĝ, Ĥ), leads to the existence of a unique Φ ∈ C(π(X), R) with Φ = Φ • π. Since X is compact, then π(X) is also compact due to the continuity of π. According to Lemma D.2 below, FGNN is a subalgebra of C(π(X), R). By Theorem 4.2, FGNN separates points on π(X). This can further imply that π(X) is Hausdorff. In fact, for any x, x ′ ∈ π(X), there exists F ∈ FGNN with F (x) ̸ = F (x ′ ). Without loss of generality, we assume that F (x) < F (x ′ ) and choose some c ∈ R with F (x) < c < F (x ′ ). By continuity of F , we know that F -1 ((-∞, c)) ∩ π(X) and F -1 ((c, +∞)) ∩ π(X) are disjoint open subsets of π(X) with x ∈ F -1 ((-∞, c)) ∩ π(X) and x ′ ∈ F -1 ((c, +∞)) ∩ π(X), which leads to the Hausdorff property of π(X). Note also that 1 ∈ FGNN . Using Theorem D.1, we can conclude the denseness of FGNN in C(π(X), R). Therefore, for any ϵ > 0, there exists F ∈ F GNN , such that sup (G,H)∈X |Φ(G, H) -F (G, H)| = sup x∈π(X) | Φ(x) -F (x)| < ϵ, which completes the proof. Lemma D.2. F GNN is a subalgebra of C(G m,n × H V m × H W n , R), and as a corollary, FGNN is a subalgebra of C(G m,n × H V m × H W n / ∼, R). Proof of Lemma D.2. It suffices to show that F GNN is closed under addition and multiplication. Consider any F, F ∈ F GNN . Thanks to Lemma D.3, we can assume that both F and F have L layers. Suppose that F is constructed by f V in : H V → R d0 , f W in : H W → R d0 , f V l , f W l : R d l-1 → R d l , g V l , g W l : R d l-1 × R d l → R d l , 1 ≤ l ≤ L, f out : R d L × R d L → R, and that F is constructed by f V in : H V → R d0 , f W in : H W → R d0 , f V l , f W l : R dl-1 → R dl , ĝV l , ĝW l : R dl-1 × R dl → R dl , 1 ≤ l ≤ L, fout : R dL × R dL → R. One can then construct two new GNNs computing F + F and F • F as follows: The input layer The update rule of the input layer is defined by f V in : H V → R d0 × R d0 , h → f V in (h), f V in (h) , f W in : H W → R d0 × R d0 , h → f W in (h), f W in (h) . Then the vertex features after the computation of the input layer h 0,V i and h 0,W j are given by: h 0,V i = f V in (h V i ) = f V in (h V i ), f V in (h V i ) = (h 0,V i , ĥ0,V i ) ∈ R d0 × R d0 , h 0,W j = f W in (h W j ) = f W in (h W j ), f W in (h W j ) = (h 0,W j , ĥ0,W j ) ∈ R d0 × R d0 , for i = 1, 2 . . . , m, and j = 1, 2, . . . , n. The l-th layer (1 ≤ l ≤ L) . We set f V l : R d l-1 × R dl-1 → R d l × R dl , (h, ĥ) → f V l (h), f V l ( ĥ) , f W l : R d l-1 × R dl-1 → R d l × R dl , (h, ĥ) → f W l (h), f W l ( ĥ) , g V l : R d l-1 × R dl-1 × R d l × R dl → R d l × R dl , ĥ, h ′ , ĥ′ ) → g W l (h, h ′ ), ĝW l ( ĥ, ĥ′ ) , and g W l : R d l-1 × R dl-1 × R d l × R dl → R d l × R dl , (h, ĥ, h ′ , ĥ′ ) → g V l (h, h ′ ), ĝV l ( ĥ, ĥ′ ) . Then the vertex features after the computation of the l-th layer h l,V i and h l,W j are given by: h l,V i = g V l   h l-1,V i , n j=1 E i,j f W l (h l-1,W j )   =   g V l   h l-1,V i , n j=1 E i,j f W l (h l-1,W j )   , ĝV l  ĥ l-1,V i , n j=1 E i,j f W l ( ĥl-1,W j )     = (h l,V i , ĥl,V i ) ∈ R d l × R dl , h l,W j = g W l h l-1,W j , m i=1 E i,j f V l (h l-1,V i ) = g W l h l-1,W j , m i=1 E i,j f V l (h l-1,V i ) , ĝW l ĥl-1,W j , m i=1 E i,j f V l ( ĥl-1,V i ) = (h l,W j , ĥl,W j ) ∈ R d l × R dl , for i = 1, 2 . . . , m, and j = 1, 2, . . . , n. The output layer To obtain F + F , we set f add out : R d L × R dL × R d L × R dL → R, ((h, ĥ), (h ′ , ĥ′ )) → f out (h, h ′ ) + fout ( ĥ, ĥ′ ). Then it holds that f add out   m i=1 h L,V i , n j=1 h L,W 1   =f out   m i=1 h L,V i , n j=1 h L,W j   + fout   m i=1 ĥL,V i , n j=1 ĥL,W j   =F (G, H) + F (G, H). To obtain F • F , we set f multiply out : R d L × R dL × R d L × R dL → R, ((h, ĥ), (h ′ , ĥ′ )) → f out (h, h ′ ) • fout ( ĥ, ĥ′ ). Then it holds that f multiply out   m i=1 h L,V i , n j=1 h L,W 1   =f out   m i=1 h L,V i , n j=1 h L,W j   • fout   m i=1 ĥL,V i , n j=1 ĥL,W j   =F (G, H) • F (G, H). The constructed GNNs satisfy F (G, H) + F (G, H) ∈ F GNN and F (G, H) • F (G, H) ∈ F GNN , which finishes the proof. Lemma D.3. If F ∈ F GNN has L layers, then there exists F ∈ F GNN with L + 1 layers such that F = F . Proof of Lemma D.3. Suppose that F is constructed by f V in , f W in , f out , {f V l , f W l , g V l , g W l } L l=0 . We choose f V L+1 ≡ 0, f W L+1 ≡ 0, g V L+1 (h, h ′ ) = h, g W L+1 (h, h ′ ) = h. Let F be constructed by f V in , f W in , f out , {f V l , f W l , g V l , g W l } L+1 l=0 . Then F has L + 1 layers with F = F . E UNIVERSAL APPROXIMATION OF F W

GNN

This section provides an universal approximation result of F W GNN . Theorem E.1. Let X ⊂ G m,n × H V m × H W n be a compact subset that is closed under the action of S m × S n . Suppose that Φ ∈ C(X, R n ) satisfies the followings: (i) For any σ V ∈ S m , σ W ∈ S n , and (G, H) ∈ X, it holds that Φ ((σ V , σ W ) * (G, H)) = σ W (Φ(G, H)). (E.1) (ii) Φ(G, H) = Φ( Ĝ, Ĥ) holds for all (G, H), ( Ĝ, Ĥ) ∈ X with (G, H) W ∼ ( Ĝ, Ĥ). (iii) Given any (G, H) ∈ X and any j, j ′ ∈ {1, 2, . . . , n}, if C l,W j = C l,W j ′ (the vertex colors obtained in the l-th iteration in WL test) holds for any l ∈ N and any choices of hash functions, then Φ(G, H) j = Φ(G, H) j ′ . Then for any ϵ > 0, there exists F ∈ F W GNN such that sup (G,H)∈X ∥Φ(G, H) -F (G, H)∥ < ϵ. Theorem E.1 is a LP-graph version of results on the closure of equivariant GNN class in Azizian & Lelarge (2021) ; Geerts & Reutter (2022) . The main tool in the proof of Theorem E.1 is the following generalized Stone-Weierstrass theorem for equivariant functions established in Azizian & Lelarge (2021) . Theorem E.2 (Generalized Stone-Weierstrass theorem (Azizian & Lelarge, 2021, Theorem 22) ). Let X be a compact topology space and let G be a finite group that acts continuously on X and R n . Define the collection of all equivariant continuous functions from X to R n as follows: C E (X, R n ) = {F ∈ C(X, R n ) : F (g * x) = g * F (x), ∀ x ∈ X, g ∈ G}. Consider any F ⊂ C E (X, R n ) and any Φ ∈ C E (X, R n ). Suppose the following conditions hold: (i) F is a subalgebra of C(X, R n ) 1 ∈ F. (ii) For any x, x ′ ∈ X, if f (x) = f (x ′ ) holds for any f ∈ C(X, R) with f 1 ∈ F, then for any F ∈ F, there exists g ∈ G such that F (x) = g * F (x ′ ). (iii) For any x, x ′ ∈ X, if F (x) = F (x ′ ) holds for any F ∈ F, then Φ(x) = Φ(x ′ ). (iv) For any x ∈ X, it holds that Φ(x) j = Φ(x) j ′ , ∀ (j, j ′ ) ∈ J(x), where J(x) = {{1, 2, . . . , n} n : F (x) j = F (x) j ′ , ∀ F ∈ F}. Then for any ϵ > 0, there exists F ∈ F such that sup x∈X ∥Φ(x) -F (x)∥ < ϵ. We refer to Timofte (2005) for different versions of Stone-Weierstrass theorem, that is also used in Azizian & Lelarge (2021) . In the proof of Theorem E.1, we also need the following lemma whose proof is almost the same as the proof of Lemma D.2 and is hence omitted. Lemma E.3. F W GNN is a subalgebra of C(G m,n × H V m × H W n , R n ). Proof of Theorem E.1. Let S m × S n act on R n via (σ V , σ W ) * y = σ W (y), ∀ σ V ∈ S m , σ W ∈ S n , y ∈ R n . Then it follows from (E.1) that Φ is equivariant. In addition, the definition of graph neural networks directly guarantees the equivariance of functions in F W GNN . Therefore, one only needs to verify the conditions with F as F W GNN and G as S n in Theorem E.2: • Condition (i) in Theorem E.2 follows from Lemma E.3 and the definition of F W GNN . • Condition (ii) in Theorem E.2 follows from Theorem 4.2 and F GNN 1 ⊂ F W GNN . • Condition (iii) in Theorem E.2 follows from Corollary C.4 and Condition (ii) in Theorem E.1. • Condition (iv) in Theorem E.2 follows from Corollary C.5 and Condition (iii) in Theorem E.1. It finishes the proof.

F PROOF OF MAIN THEOREMS

We collect the proofs of main theorems stated in Section 3 in this section. Proof of Theorem 3.1. If F (G, H) = F ( Ĝ, Ĥ) holds for any F ∈ F GNN , then Theorem 4.2 guarantees that (G, H) ∼ ( Ĝ, Ĥ). Thus, Condition (i), (ii), and (iii) follow directly from Lemmas B.2, B.3, and B.4, respectively. Furthermore, if F W (G, H) = F W ( Ĝ, Ĥ), ∀ F W ∈ F W GNN , then it follows from Corollary C.4 and Corollary B.7 that the two LP problems associated to (G, H) and ( Ĝ, Ĥ) share the same optimal solution with the smallest ℓ 2 -norm. Then we head into the proof of three main approximation theorems, say Theorem 3.2, 3.4, and 3.6, that state that GNNs can approximate the feasibility mapping Φ feas , the optimal objective value mapping Φ obj , and the optimal solution mapping Φ solu with arbitrarily small error, respectively. We have established in Sections D and E several theorems for graph neural networks to approximate continuous mappings. Therefore, the proof of Theorem 3.2, 3.4, and 3.6 basically consists of two steps: (i) Show that the mappings Φ feas , Φ obj , and Φ solu are measurable. (ii) Use continuous mappings to approximate the target measurable mappings, and then apply the universal approximation results established in Sections D and E. Let us first prove the feasibility of Φ feas , Φ obj , and Φ solu in the following three lemmas. Lemma F.1. The feasibility mapping Φ feas defined in (2.7) is measurable, i.e., the preimages Φ -1 feas (0) and Φ -1 feas (1) are both measurable subsets of G m,n × H V m × H W n . Proof of Lemma F.1. It suffices to prove that for any • ∈ {≤, =≥} m and any N l , N u ⊂ {1, 2, . . . , n}, the set where y + = max{y, 0}. It can be seen that V feas is continuous and V feas (A, b, l, u, x) = 0 if and only if Ax • b, x j ≥ l j , ∀ j ∈ N l , and x j ≤ u j , ∀ j ∈ N u . Therefore, for any (A, b, l, u) ∈ R m×n × R m × R |N l | × R |Nu| , the followings are equivalent: X feas := {(A, b, l, u) ∈ R m×n × R m × R |N l | × R |Nu| : ∃ x ∈ R n , s.t. Ax • b, x j ≥ l j , ∀ j ∈ N l , x j ≤ u j , ∀ j ∈ N u }, • (A, b, l, u) ∈ X feas . • There exists R ∈ N + and x ∈ B R := {x ′ ∈ R n : ∥x ′ ∥ ≤ R}, such that V feas (A, b, l, u, x) = 0. • There exists R ∈ N + such that for any r ∈ N + , V feas (A, b, l, u, x) ≤ 1/r holds for some x ∈ B R ∩ Q n . This implies that X feas can be described via R∈N+ r∈N+ x∈B R ∩Q n (A, b, l, u) ∈ R m×n × R m × R |N l | × R |Nu| : V feas (A, b, l, u, x) ≤ 1 r . Since Q n is countable and V is continuous, we immediately obtain from the above expression that X feas is measurable. Lemma F.2. The optimal objective value mapping Φ obj defined in (2.8) is measurable. Proof of Lemma F.2. It suffices to prove that for any • ∈ {≤, =≥} m , any N l , N u ⊂ {1, 2, . . . , n}, and any ϕ ∈ R, the set X obj := {(A, b, c, l, u) ∈ R m×n × R m × R n × R |N l | × R |Nu| : ∃ x ∈ R n , s.t. c ⊤ x ≤ ϕ, Ax • b, x j ≥ l j , ∀ j ∈ N l , x j ≤ u j , ∀ j ∈ N u }, is a measurable subset in R m×n × R m × R n × R |N l | × R |Nu| . Then the proof follows the same lines as in the proof of Lemma F.1, with a different violation function V obj (A, b, c, l, u, x) = max (c ⊤ x -ϕ) + , V feas (A, b, l, u, x) . Lemma F.3. The optimal solution mapping Φ solu defined in (2.9) is measurable. Proof of Lemma F.3. It suffices to show that for every j 0 ∈ {1, 2, . . . , n}, the mapping π j0 • Φ solu : Φ -1 obj (R) → R, is measurable, where π j0 : R n → R maps a vector x ∈ R n to its j 0 -th component. Similar as before, one can consider any • ∈ {≤, =≥} m , any N l , N u ⊂ {1, 2, . . . , n}, and any ϕ ∈ R, and prove that the set X solu := {(A, b, c, l, u) ∈ R m×n × R m × R n × R |N l | × R |Nu| : The LP problem, min x∈R d c ⊤ x, s.t. Ax • b, x j ≥ l j , ∀ j ∈ N l , x j ≤ u j , ∀ j ∈ N u , has a finite optimal objective value, and its optimal solution with the smallest ℓ 2norm, x opt , satisfies (x opt ) j0 < ϕ}, is measurable. Note that we have fixed • ∈ {≤, =≥} m and N l , N u ⊂ {1, 2, . . . , n}. Let ι : R m×n × R m × R n × R |N l | × R |Nu| → G m,n × H V m × H W n , be the embedding map. Define another violation function (A, b, c, l, u, x) = max c ⊤ x -Φ obj (ι(A, b, c, l, u)) + , V feas (A, b, c, l, u, x) , which is measurable with respect to (A, b, c, l, u) for any fixed x ∈ R n , due to the measurability of Φ obj and the continuity of V feas . Moreover, V solu is continuous with respect to x. Therefore, the followings are equivalent for (A, b, c, l, u) ∈ (Φ obj • ι) -1 (R): V solu : (Φ obj • ι) -1 (R) × R n → R, via V solu • (A, b, c, l, u) ∈ X solu . • There exists x ∈ R n with x j0 < ϕ, such that V solu (A, b, c, l, u, x) = 0 and V solu (A, b, c, l, u, x ′ ) > 0, ∀ x ′ ∈ B ∥x∥ , x ′ j0 ≥ ϕ. • There exists R ∈ Q + , r ∈ N + , and x ∈ B R with x j0 ≤ ϕ-1/r, such that V solu (A, b, c, l, u, x) = 0 and V solu (A, b, c, l, u, x ′ ) > 0, ∀ x ′ ∈ B R , x ′ j0 ≥ ϕ. • There exists R ∈ Q + and r ∈ N + , such that for all r ′ ∈ N + , ∃ x ∈ B R ∩ Q n , x j0 ≤ ϕ -1/r, s.t. V solu (A, b, c, l, u, x) < 1/r ′ and that ∃ r ′′ ∈ N + , s.t., V solu (A, b, c, l, u, x ′ ) ≥ 1/r ′′ , ∀ x ′ ∈ B R ∩ Q n , x ′ j0 ≥ ϕ.



Note that the update rules in (2.2) and (2.3) follow a message-passing way, where each vertex only collects information from its neighbors. Since Ei,j = 0 if there is no connection between vertices vi and wj, the sum operator in (2.2) can be rewritten as j∈N (v i ) , where N (vi) denotes the set of neighbors of vertex vi. In Algorithm 1, multisets, denoted by {{}}, are collections of elements that allow multiple appearance of the same element. Hash functions {HASH l,V , HASH l,W } L l=0 injectively map vertex information to vertex colors, while the others {HASH ′ l,V , HASH ′ l,W } L l=1 injectively map vertex colors to a linear space so that one can define sum and scalar multiplications on their outputs. In addition, such an algorithm is usually named as the 1-WL test in the literature since it only considers the neighborhood with distance 1 for each vertex. In this paper, we abbreviate Algorithm 1 or the 1-WL test to the WL test for simplicity.



Theorem 4.2 extends the results in Xu et al. (2019); Azizian & Lelarge (2021); Geerts & Reutter (2022) to the case with the modified WL test (Algorithm 1) and LP-graphs.

Figure 3: GNN can approximate Φ feas , Φ obj , and Φ solu

One can actually obtain a stronger version of Lemma B.4 given (G, H) W ∼ ( Ĝ, Ĥ). Corollary B.7. Suppose that two weighted bipartite graphs with vertex features corresponding to two LP problems, (G, H) and ( Ĝ, Ĥ), satisfies that (G, H) W ∼ ( Ĝ, Ĥ). Then these two problems have the same optimal solution with the smallest ℓ 2 -norm. Proof of Corollary B.7. The proof of Lemma B.4 still applies with the difference that there is no permutation on {w 1 , w 2 , . . . , w n }. Proof of Theorem 4.1. The proof of Theorem 4.1 follows immediately from Lemmas B.2, B.3 and B.4.

1, 2, . . . , m and j = 1, 2,• • • , n. Define f out : R × R → R via f out (h, h ′ ) = h. Then it follows that

Proof of Theorem 4.2. The equivalence of the three conditions follow immediately from Lemma C.1, C.2, and C.3. Corollary C.4. For any two weighted bipartite graphs with vertex features (G, H), ( Ĝ, Ĥ) ∈ G m,n × H V m × H W n , the followings are equivalent: (i) (G, H) W ∼ ( Ĝ, Ĥ).

is a measurable subset in R m×n × R m × R |N l | × R |Nu| .Without loss of generality, we assume that • = (≤, . . . , ≤, =, . . . , =, ≥, . . . , ≥) where "≤", "=", and "≥" appear for k 1 , k 2 -k 1 , andm -k 1 -k 2 times, respectively, 0 ≤ k 1 ≤ k 2 ≤ m.Let us define a functionV feas : R m×n × R m × R |N l | × R |Nu| × R n → R ≥0 that measuresto what extend a point in R n violates the constraints:V feas(A, b, l, u, x)

annex

Therefore, one has thatwhich is measurable.With the measurability of Φ feas , Φ obj , and Φ solu established, the next step is to approximate Φ feas , Φ obj , and Φ solu using continuous mappings, and hence graph neural networks. Before proceeding, let us mention that G m,n × H V m × H W n is essentially the disjoint union of finitely many product spaces of Euclidean spaces and discrete spaces that have finitely many points and are equipped with discrete measures. More specifically,Therefore, many results in real analysis for Euclidean spaces still apply for G m,n × H V m × H W n and Meas(•), including the following Lusin's theorem. Theorem F.4 (Lusin's theorem (Evans & Garzepy, 2018, Theorem 1.14) ). Let µ be a Borel regular measure on R n and let f : R n → R m be µ-measurable. Then for any µ-measurable X ⊂ R n with µ(X) < ∞ and any ϵ > 0, there exists a compact set E ⊂ X with µ(X\E) < ϵ, such that f | E is continuous.Proof of Theorem 3.2. Since X ⊂ G m,n ×H V m ×H W n is measurable with finite measure, according to Lusin's theorem, there is a compact set E ⊂ X such that Φ feas | E is continuous with Meas(X\E) < ϵ. By Lemma B.2 and Theorem 4.3, there existsand the proof is completed.Proof of Corollary 3.3. As a finite set, D is compact and Φ feas | D is continuous. The rest of the proof is similar to that of Theorem 3.2, using Lemma B.2 and Theorem 4.3.Proof of Theorem 3.4. (i) The proof follows the same lines as in the proof of Theorem 3.2, with the difference that we approximate Proof of Corollary 3.5. The results can be proved using similar techniques as in Theorem 3.2 and Theorem 3.4 by noticing that any finite dataset is compact on which any real-valued function is continuous.Proof of Theorem 3.6. Without loss of generality, we can assume thatwhich is closed under the action of S m × S n and satisfiesNote that the three conditions in Theorem E.1 are satisfied by the definition of Φ solu , Corollary B.7, and Corollary B.6, respectively. Using Theorem E.1, there existswhich completes the proof.Proof of Corollary 3.7. One can assume that D is closed under the action of S m × S n ; otherwise, a larger but still finite dataset, (σ V ,σ W )∈Sm×Sn (σ V , σ W ) * D, can be considered instead of D. The rest of the proof is similar to that of Theorem 3.6 since D is compact and Φ solu | D is continuous.

G DETAILS OF THE NUMERICAL EXPERIMENTS AND EXTRA EXPERIMENTS

LP instance generation We generate each LP with the following way. We set m = 10 and n = 50. Each matrix A is sparse with 100 nonzero elements whose positions are sampled uniformly and values are sampled normally. Each element in b, c are sampled i.i.d and uniformly from [-1, 1]. Additionally, each element in c is scaled by 0.01. The variable bounds l, u are sampled with N (0, 10). If l j > u j , then we swap l j and u j for all 1 ≤ j ≤ n. Furthermore, we sample • i i.i.d with P(• i = " ≤ ") = 0.7 and P(• i = " = ") = 0.3. With the generation approach above, the probability that each LP to be feasible is around 0.53.

MLP architectures

As we mentions in the main text, all the learnable functions in GNN are taken as MLPs. The input functions f V in , f W in have one hidden layer and other functions 4, 8, 16, 32, 64, 128, 256 One can observe that, for a GNN with fixed size, its generalization performance, i.e., the performance on the testing set is increasing if it is trained with more training samples. Given these numerical results, we believe that understanding the generalization quantitatively and theoretically deserves future research.

