ON REPRESENTING MIXED-INTEGER LINEAR PRO-GRAMS BY GRAPH NEURAL NETWORKS

Abstract

While Mixed-integer linear programming (MILP) is NP-hard in general, practical MILP has received roughly 100-fold speedup in the past twenty years. Still, many classes of MILPs quickly become unsolvable as their sizes increase, motivating researchers to seek new acceleration techniques for MILPs. With deep learning, they have obtained strong empirical results, and many results were obtained by applying graph neural networks (GNNs) to making decisions in various stages of MILP solution processes. This work discovers a fundamental limitation: there exist feasible and infeasible MILPs that all GNNs will, however, treat equally, indicating GNN's lacking power to express general MILPs. Then, we show that, by restricting the MILPs to unfoldable ones or by adding random features, there exist GNNs that can reliably predict MILP feasibility, optimal objective values, and optimal solutions up to prescribed precision. We conducted small-scale numerical experiments to validate our theoretical findings.

1. INTRODUCTION

Mixed-integer linear programming (MILP) is a type of optimization problems that minimize a linear objective function subject to linear constraints, where some or all variables must take integer values. MILP has a wide type of applications, such as transportation (Schouwenaars et al., 2001) , control (Richards & How, 2005) , scheduling (Floudas & Lin, 2005) , etc. Branch and Bound (B&B) (Land & Doig, 1960) , an algorithm widely adopted in modern solvers that exactly solves general MILPs to global optimality, unfortunately, has an exponential time complexity in the worstcase sense. To make MILP more practical, researchers have to analyze the features of each instance of interest based on their domain knowledge, and use such features to adaptively warm-start B&B or design the heuristics in B&B. To automate such laborious process, researchers turn attention to Machine learning (ML) techniques in recent years (Bengio et al., 2021) . The literature has reported some encouraging findings that a proper chosen ML model is able to learn some useful knowledge of MILP from data and generalize well to some similar but unseen instances. For example, one can learn fast approximations of Strong Branching, an effective but time-consuming branching strategy usually used in B&B (Alvarez et al., 2014; Khalil et al., 2016; Zarpellon et al., 2021; Lin et al., 2022) . One may also learn cutting strategies (Tang et al., 2020; Berthold et al., 2022; Huang et al., 2022) , node selection/pruning strategies (He et al., 2014; Yilmaz & Yorke-Smith, 2020) , or decomposition strategies (Song et al., 2020) with ML models. The role of ML models in those approaches can be summarized as: approximating useful mappings or parameterizing key strategies in MILP solvers, and these mappings/strategies usually take an MILP instance as input and output its key peroperties. The graph neural network (GNN), due to its nice properties, say permutation invariance, is considered a suitable model to represent such mappings/strategies for MILP. More specifically, permutations on variables or constraints of an MILP do not essentially change the problem itself, reliable ML models such as GNNs should satisfy such properties, otherwise the model may overfit to the variable/constraint orders in the training data. Gasse et al. (2019) proposed that an MILP can be encoded into a bipartite graph on which one can use a GNN to approximate Strong Branching. Ding et al. (2020) proposed to represent MILP with a tripartite graph. Since then, GNNs have been adopted to represent mappings/strategies for MILP, for example, approximating Strong Branching (Gupta et al., 2020; Nair et al., 2020; Shen et al., 2021; Gupta et al., 2022) , approximating optimal solution (Nair et al., 2020; Khalil et al., 2022) , parameterizing cutting strategies (Paulus et al., 2022) , and parameterizing branching strategies (Qu et al., 2022; Scavuzzo et al., 2022) . However, theoretical foundations in this direction still remain unclear. A key problem is the ability of GNN to approximate important mappings related with MILP. We ask the following questions: Is GNN able to predict whether an MILP is feasible? (Q1) Is GNN able to approximate the optimal objective value of an MILP? (Q2) Is GNN able to approximate an optimal solution of an MILP? (Q3) To answer questions (Q1 -Q3), one needs theories of separation power and representation power of GNN. The separation power of GNN is measured by whether it can distinguish two non-isomorphic graphs. Given two graphs G 1 , G 2 , we say a mapping F (e.g. a GNN) has strong separation power if F (G 1 ) ̸ = F (G 2 ) as long as G 1 , G 2 that are not the isomorphic. In our settings, since MILPs are represented by graphs, the separation power actually indicates the ability of GNN to distinguish two different MILP instances. The representation power of GNN refers to how well it can approximate mappings with permutation invariant properties. In our settings, we study whether GNN can map an MILP to its feasiblity, optimal objective value and an optimal solution. The separation power and representation power of GNNs are closely related to the Weisfeiler-Lehman (WL) test (Weisfeiler & Leman, 1968 ), a classical algorithm to identify whether two graphs are isomorphic or not. It has been shown that GNN has the same separation power with the WL test (Xu et al., 2019) , and, based on this result, GNNs can universally approximate continuous graph-input mappings with separation power no stronger than the WL test (Azizian & Lelarge, 2021; Geerts & Reutter, 2022) . Our contributions With the above tools in hand, one still cannot directly answer questions (Q1 -Q3) since the relationship between characteristics of general MILPs and properties of graphs are not clear yet. Although there are some works studying the representation power of GNN on some graphrelated optimization problems (Sato et al., 2019; Loukas, 2020) and linear programming (Chen et al., 2022) , representing general MILPs with GNNs are still not theoretically studied, to the best of our knowledge. Our contributions are listed below: • (Limitation of GNNs for MILP) We show with an example that GNNs do not have strong enough separation power to distinguish any two different MILP instances. There exist two MILPs such that one of them is feasible while the other one is not, but, unfortunately, all GNNs treat them equally without detecting the essential difference between them. In fact, there are infinitely many pairs of MILP instances that can puzzle GNN. • (Foldable and unfoldable MILP) We provide a precise mathematical description on what type of MILPs makes GNNs fail. These hard MILP instances are named as foldable MILPs. We prove that, for unfoldable MILPs, GNN has strong enough separation power and representation power to approximate the feasibility, optimal objective value and an optimal solution. • (MILP with random features) To handle those foldable MILPs, we propose to append random features to the MILP-induced graphs. We prove that, with the random feature technique, the answers to questions (Q1 -Q3) are affirmative. The answers to (Q1 -Q3) serve as foundations of a more practical question: whether GNNs are able to predict branching strategies or primal heuristics for MILP? Although the answers to (Q1) and (Q2) do not directly answer that practical question, they illustrate the possibility that GNNs can capture some key information of a MILP instance and have the capacity to suggest an adaptive branching strategy for each instance. To obtain a GNN-based strategy, practitioners should consider more factors: feature spaces, action spaces, training methods, generalization performance, etc., and some recent empirical studies show encouraging results on learning such a strategy (Gasse et al., 2019; Gupta et al., 2020; 2022; Nair et al., 2020; Shen et al., 2021; Qu et al., 2022; Scavuzzo et al., 2022; Khalil et al., 2022 ). The answer to (Q3) directly shows the possibility of learning primal heuristics. With proper model design and training methods, one could obtain competitive GNN-based primal heuristics (Nair et al., 2020; Ding et al., 2020) . The rest of this paper is organized as follows. We state some preliminaries in Section 2. The limitation of GNN for MILP is presented in Section 3 and we provide the descriptions of foldable and unfoldable MILPs in Section 4. The random feature technique is introduced in Section 5. Section 6 contains some numerical results and the whole paper is concluded in Section 7.

2. PRELIMINARIES

In this section, we introduce some preliminaries that will be used throughout this paper. Our notations, definitions, and examples follow Chen et al. (2022) . Consider a general MILP problem defined with: min x∈R n c ⊤ x, s.t. Ax • b, l ≤ x ≤ u, x j ∈ Z, ∀ j ∈ I, (2.1) where A ∈ R m×n , c ∈ R n , b ∈ R m , l ∈ (R ∪ {-∞}) n , u ∈ (R ∪ {+∞}) n , and • ∈ {≤, =, ≥} m . The index set I ⊂ {1, 2, . . . , n} includes those indices j where x j are constrained to be an integer. The feasible set is defined with X feasible := {x ∈ R n |Ax • b, l ≤ x ≤ u, x j ∈ Z, ∀ j ∈ I}, and we say an MILP is infeasible if X feasible = ∅ and feasible otherwise. For feasible MILPs, inf{c ⊤ x : x ∈ X feasible } is named as the optimal objective value. If there exists x * ∈ X feasible with c ⊤ x * ≤ c ⊤ x, ∀ x ∈ X feasible , then we say that x * is an optimal solution. It is possible that the objective value is arbitrarily good, i.e., for any R > 0, c ⊤ x < -R holds for some x ∈ X feasible . In this case, we say the MILP is unbounded or its optimal objective value is -∞.

2.1. MILP AS A WEIGHTED BIPARTITE GRAPH WITH VERTEX FEATURES

Inspired by Gasse et al. (2019) and following Chen et al. (2022) , we represent MILP by weighted bipartite graphs. The vertex set of such a graph is V ∪ W , where V = {v 1 , v 2 , . . . , v m } with v i representing the i-th constraint, and W = {w 1 , w 2 , . . . , w m } with w j representing the j-th variable. To fully represent all information in (2.1), we associate each vertex with features: The vertex v i ∈ V is equipped with a feature vector h V i = (b i , • i ) that is chosen from H V = R × {≤, =, ≥}. The vertex w j ∈ W is equipped with a feature vector h W j = (c j , l j , u j , τ j ), where τ j = 1 if j ∈ I and τ j = 0 otherwise. The feature h W j is in the space H W = R × (R ∪ {-∞}) × (R ∪ {+∞}) × {0, 1}. The edge E i,j ∈ R connects v i ∈ V and w j ∈ W , and its value is defined with E i,j = A i,j . Note that there is no edge connecting vertices in the same vertex group (V or W ) and E i,j = 0 if there is no connection between v i and w j . Thus, the whole graph is denoted as G = (V ∪ W, E), and we denote G m,n as the collection of all such weighted bipartite graphs whose two vertex groups have size m and n, respectively. Finally, we define H V m := (H V ) m and H W n := (H W ) n , and stack all the vertex features together as H = (h V 1 , h V 2 , . . . , h V m , h W 1 , h W 2 , . . . , h W n ) ∈ H V m × H W n . Then a weighted bipartite graph with vertex features (G, H) ∈ G m,n × H V m × H W n contains all information in the MILP problem (2.1), and we name such a graph as an MILP-induced graph or MILP-graph. foot_0 If I = ∅, the feature τ j can be dropped and the graphs reduce to LP-graphs in Chen et al. (2022) . We provide an example of MILP-graph in Figure 1 .

2.2. GRAPH NEURAL NETWORKS WITH MESSAGE PASSING FOR MILP-GRAPHS

To represent properties of the whole graph, one needs to build a GNN that maps (G, H) to a real number: G m,n ×H V m ×H W n → R; to represent properties of each vertex in W (or represent properties for each variable), one needs to build a GNN that maps (G, H) to a vector: G m,n ×H V m ×H W n → R n . The GNNs used in this paper can be constructed with the following three steps: min x∈R 2 x1 + x2, s.t. x1 + 3x2 ≥ 1, x1 + x2 ≥ 1, x1 ≤ 3, x2 ≤ 5, x2 ∈ Z. v 1 v 2 w 1 w 2 h V 1 = (1, ≥) h V 2 = (1, ≥) h W 1 = (1, -∞, 3, 0) h W 2 = (1, -∞, 5, 1) 1 3 1 1 Figure 1: An example of MILP-graph (1) Initial mapping at level l = 0. Let p 0 and q 0 be learnable embedding mappings, then the embedded features are s 0 i = p 0 (h V i ) and t 0 j = q 0 (h W j ) for i = 1, 2, . . . , m and j = 1, 2, . . . , n. (2) Message-passing layers for l = 1, 2, . . . , L. Given learnable mappings f l , g l , p l , q l , one can update vertex features by the following formulas for all i = 1, 2, . . . , m and j = 1, 2, . . . , n: s l i = p l s l-1 i , n j=1 E i,j f l (t l-1 j ) , t l j = q l t l-1 j , m i=1 E i,j g l (s l-1 i ) . (3a) Last layer for graph-level output. Define a learnable mapping r G and the output is a scalar tL ) , where sL = m i=1 s L i and tL = n j=1 t L j . (3b) Last layer for node-level output. Define a learnable mapping r W and the output is a vector y ∈ R n with entries being y j = r W (s L , tL , t L j ) for j = 1, 2, . . . , n. Throughout this paper, we require all learnable mappings to be continuous following the same settings as in Chen et al. (2022) ; Azizian & Lelarge (2021) . In practice, one may parameterize those mappings with multi-layer perceptions (MLPs). Define F GNN as the set of all GNN mappings connecting layers (1), (2), and (3a). F W GNN is defined similarly by replacing (3a) with (3b). 2.3 MAPPINGS TO REPRESENT MILP CHARACTERISTICS Now we introduce the mappings that are what we aim to approximate by GNNs. With the definitions, we will revisit questions (Q1-Q3) and describe them in a mathematically precise way. y G = r G (s L ,

Feasibility mapping

We first define the following mapping that indicates the feasibility of MILP: Φ feas : G m,n × H V m × H W n → {0, 1} , where Φ feas (G, H) = 1 if (G, H) corresponds to a feasible MILP and Φ feas (G, H) = 0 otherwise. Optimal objective value mapping We then define the following mapping that maps an MILP to its optimal objective value: Φ obj : G m,n × H V m × H W n → R ∪ {∞, -∞} , where Φ obj (G, H) = ∞ implies infeasibility and Φ obj (G, H) = -∞ implies unboundedness. Note that the optimal objective value for MILP may be an infimum that can never be achieved. An example would be min x∈Z 2 x 1 + πx 2 , s.t. x 1 + πx 2 ≥ √ 2. The optimal objective value is √ 2, since for any ϵ > 0, there exists a feasible x with x 1 + πx 2 < √ 2 + ϵ. However, there is no x ∈ Zfoot_1 such that x 1 + πx 2 = √ 2. Thus, the preimage Φ -1 obj (R) cannot precisely describe all MILP instances with an optimal solution, it describes MILP problems with a finite optimal objective. Optimal solution mapping To give a well-defined optimal solution mapping is much more complicated 2 since the optimal objective value, as we discussed before, may never be achieved in some cases. To handle this issue, we only consider the case that any component in l or u must be finite, which implies that an optimal solution exists as long as the MILP problem is feasible. More specifically, the vertex feature space is limited to H W n = (R × R × R × {0, 1}) n ⊂ H W n and we consider MILP problems taken from D solu = G m,n × H V m × H W n ∩ Φ -1 feas (1). Note that Φ -1 feas (1) describes the set of all feasible MILPs. Consequently, any MILP instance in D solu admits at least one optimal solution. We can further define the following mapping which maps an MILP to exactly one of its optimal solutions: Φ solu : D solu \D foldable → R n , where D foldable is a subset of G m,n × H V m × H W n that will be introduced in Section 4. The full definition of Φ solu is placed in Appendix C due to its tediousness. Invariance and equivariance Now we discuss some properties of the three defined mappings. Mappings Φ feas and Φ obj are permutation invariant because the feasibility and optimal objective of an MILP would not change if the variables or constraints are reordered. We say the mapping Φ solu is permutation equivariant because the solution of an MILP should be reordered consistently with the permutation on the variables. Now we define S m as the group contains all permutations on the constraints of MILP and S n as the group contains all permutations on the variables. For any σ V ∈ S m and σ W ∈ S n , (σ V , σ W ) * (G, H) denotes the reordered MILP-graph with permutations σ V , σ W . It is clear that both Φ feas and Φ obj are permutation invariant in the following sense: Φ feas ((σ V , σ W ) * (G, H)) = Φ feas (G, H), Φ obj ((σ V , σ W ) * (G, H)) = Φ obj (G, H), for all σ V ∈ S m , σ W ∈ S n , and (G, H) ∈ G m,n × H V m × H W n . In addition, Φ solu is permutation equivariant: Φ solu ((σ V , σ W ) * (G, H)) = σ W (Φ solu (G, H)), for all σ V ∈ S m , σ W ∈ S n ,

and

(G, H) ∈ D solu \D foldable . This will be discussed in Section C. Furthermore, one may check that any F ∈ F GNN is invariant and any F W ∈ F W GNN is equivariant. Revisiting questions (Q1), (Q2) and (Q3) With the definitions above, questions (Q1),(Q2) and (Q3) actually ask: Given any finite set D ⊂ G m,n × H V m × H W n , is there F ∈ F GNN such that F well approximates Φ feas or Φ obj on set D? Given any finite set D ⊂ D solu \D foldable , is there F W ∈ F W GNN such that F W is close to Φ solu on set D?

3. DIRECTLY APPLYING GNNS MAY FAIL ON GENERAL DATASETS

In this section, we show a limitation of GNN to represent MILP. To well approximate the mapping Φ feas , F GNN should have stronger separation power than Φ feas : for any two MILP instances (G, H), ( Ĝ, Ĥ) ∈ G m,n × H V m × H W n , Φ feas (G, H) ̸ = Φ feas ( Ĝ, Ĥ) implies F (G, H) ̸ = F ( Ĝ, Ĥ) for some F ∈ F GNN . In another word, as long as two MILP instances have different feasibility, there should be some GNNs that can detect that and give different outputs. Otherwise, we way that the whole GNN family F GNN cannot distinguish two MILPs with different feasibility, hence, GNN cannot well approximate Φ feas . This motivate us to study the separation power of GNN for MILP. In the literature, the separation power of GNN is usually measured by so-called Weisfeiler-Lehman (WL) test (Weisfeiler & Leman, 1968 ). We present a variant of WL test specially modified for MILP in Algorithm 1 that follows the same lines as in Chen et al. (2022, Algorithm 1) , where each vertex is labeled with a color. For example, v i ∈ V is initially labeled with C 0,V i based on its feature h V i by hash function HASH 0,V that is assumed to be powerful enough such that it labels the vertices that have distinct information with distinct colors. After that, each vertex iteratively updates its color, based on its own color and information from its neighbors. Roughly speaking, as long as two vertices in a graph are essentially different, they will get distinct colors finally. The output of Algorithm 1 contains all vertex colors {{C L,V i }} m i=0 , {{C L,W j }} n j=0 , where {{}} refers to a multiset in which the multiplicity of an element can be greater than one. Such approach is also named as color refinement (Berkholz et al., 2017; Arvind et al., 2015; 2017) . Algorithm 1 WL test for MILP-Graphs 3 (denoted by WL MILP ) Require: A graph instance (G, H) ∈ G m,n × H V m × H W n and iteration limit L > 0. 1: Initialize with C 0,V i = HASH 0,V (h V i ), C 0,W j = HASH 0,W (h W j ). 2: for l = 1, 2, • • • , L do 3: C l,V i = HASH l,V C l-1,V i , n j=1 E i,j HASH ′ l,W C l-1,W j . 4: C l,W j = HASH l,W C l-1,W j , m i=1 E i,j HASH ′ l,V C l-1,V i . 5: end for 6: return The multisets containing all colors {{C L,V i }} m i=0 , {{C L,W j }} n j=0 . 3 In Algorithm 1, the hash functions {HASH l,V , HASH l,W } L l=0 can be any injective mappings defined on given domains, their output spaces consist of all possible vertex colors. The other hash functions {HASH ′ l,V , HASH ′ l,W } L l=0 are required to injectively map vertex colors to a linear space because we need to define sum and scalar multiplication on their outputs. Actually, Lemma 3.2 also applies on the WL test with a more general update scheme: C l,V i = HASH l,V (C l-1,V i , {{HASH ′ l,W (C l-1,W j , Ei,j)}}), C l,W j = HASH l,W (C l-1,W j , {{HASH ′ l,V (C l-1,V i , Ei,j)}}). Unfortunately, there exist some non-isomorphic graph pairs that WL test fail to distinguish (Douglas, 2011) . Throughout this paper, we use (G, H) ∼ ( Ĝ, Ĥ) to denote that (G, H) and ( Ĝ, Ĥ) cannot be distinguished by the WL test, i.e., WL MILP ((G, H), L) = WL MILP (( Ĝ, Ĥ), L) holds for any L ∈ N and any hash functions. The following theorem indicates that F GNN actually has the same separation power with the WL test. Theorem 3.1 (Theorem 4.2 in Chen et al. ( 2022)). For any (G, H), ( Ĝ, Ĥ) ∈ G m,n × H V m × H W n , it holds that (G, H) ∼ ( Ĝ, Ĥ) if and only if F (G, H) = F ( Ĝ, Ĥ), ∀ F ∈ F GNN . Theorem 3.1 is stated and proved in Chen et al. (2022) for LP-graphs, but it actually also applies for MILP-graphs. The intuition for Theorem 3.1 is straightforward since the color refinement in WL test and the message-passing layer in GNNs have similar structure. The proof of Chen et al. (2022, Theorem 4 .2) basically uses the fact that given finitely many inputs, there always exist hash functions or continuous functions that are injective and can generate linearly independent outputs, i.e., without collision. We also remark that the equivalence between the separation powers of GNNs and WL test has been investigated in some earlier literature (Xu et al., 2019; Azizian & Lelarge, 2021; Geerts & Reutter, 2022) . Unfortunately, the following lemma reveals that the separation power of WL test is weaker than Φ feas , and, consequently, GNN has weaker separation power than Φ feas , on some specific MILP datasets. Lemma 3.2. There exist two MILP problems (G, H) and ( Ĝ, Ĥ) with one being feasible and the other one being infeasible, such that (G, H) ∼ ( Ĝ, Ĥ). Proof of Lemma 3.2. Consider two MILP problems and their induced graphs: min x∈R 6 x1 + x2 + x3 + x4 + x5 + x6, s.t. x1 + x2 = 1, x2 + x3 = 1, x3 + x4 = 1, x4 + x5 = 1, x5 + x6 = 1, x6 + x1 = 1, 0 ≤ xj ≤ 1, xj ∈ Z, ∀ j ∈ {1, 2, . . . , 6}. min x∈R 6 x1 + x2 + x3 + x4 + x5 + x6, s.t. x1 + x2 = 1, x2 + x3 = 1, x3 + x1 = 1, x4 + x5 = 1, x5 + x6 = 1, x6 + x4 = 1, 0 ≤ xj ≤ 1, xj ∈ Z, ∀ j ∈ {1, 2, . . . , 6}. v 1 v 2 v 3 v 4 v 5 v 6 w 1 w 2 w 3 w 4 w 5 w 6 v 1 v 2 v 3 v 4 v 5 v 6 w 1 w 2 w 3 w 4 w 5 w 6 Figure 2: Two non-isomorphic MILP graphs that cannot be distinguished by WL test The two MILP-graphs in Fugure 2 can not be distinguished by WL test, which can be proved by induction. First we consider the initial step in Algorithm 1. Based on the definitions in Section 2.2, we can explicitly write down the vertex features for each vertex here: h V i = (1, =), 1 ≤ i ≤ 6 and h W j = (1, 0, 1, 1), 1 ≤ j ≤ 6. Since all the vertices in V share the same information, they are labeled with an uniform color C 2 ), whatever the hash functions we choose. With the same argument, one would obtain 0,V 1 = C 0,V 2 • • • = C 0,V 6 (We use blue in Figure C 0,W 1 = C 0,W 2 • • • = C 0,W 6 and label all vertices in W with red. Both of the two graphs will be initialized in such an approach. Assuming {C l,V i } 6 i=1 are all blue and {C l,W j } 6 j=1 are all red, one will obtain for both of the graphs in Figure 2  that C l+1,V 1 = C l+1,V 2 • • • = C l+1,V 6 and C l+1,W 1 = C l+1,W 2 • • • = C l+1,W 6 based on the update rule in Algorithm 1, because each blue vertex has two red neighbors and each red vertex has two blue neighbors, and each edge connecting a blue vertex with a red one has weight 1. This concludes that, one cannot distinguish the two graphs by checking the outputs of the WL test. However, the first MILP problem is feasible, since x = (0, 1, 0, 1, 0, 1) is a feasible solution, while the second MILP problem is infeasible, since the constraints imply 3 = 2(x 1 + x 2 + x 3 ) ∈ 2Z, which is a contradiction. For those instances in Figure 2 , GNNs may struggle to predict a meaningful neural diving or branching strategy. In neural diving (Nair et al., 2020) , one usually trains a GNN to predict an estimated solution; in neural branching (Gasse et al., 2019) , one usually trains a GNN to predict a ranking score for each variable and uses that score to decide which variable is selected to branch first. However, the analysis above illustrates that the WL test or GNNs cannot distinguish the 6 variables, and one cannot get a meaningful ranking score or estimated solution.

4. UNFOLDABLE MILP PROBLEMS

We prove with example in Section 3 that one may not expect good performance of GNN to approximate Φ feas on a general dataset of MILP problems. It's worth to ask: is it possible to describe the common characters of those hard examples? If so, one may restrict the dataset by removing such instances, and establish a strong separation/representation power of GNN on that restricted dataset. The following definition provides a rigorous description of such MILP instances. Definition 4.1 (Foldable MILP). Given any MILP instance, one would obtain vertex colors {C l,V i , C l,W j } l,i,j by running Algorithm 1. We say that an MILP instance can be folded, or is foldable, if there exist 1 ≤ i, i ′ ≤ m or 1 ≤ j, j ′ ≤ n such that C l,V i = C l,V i ′ , i ̸ = i ′ , or C l,W j = C l,W j ′ , j ̸ = j ′ , for any l ∈ N and any hash functions. In another word, at least one color in the multisets generated by the WL test always has a multiplicity greater than 1. Furthermore, we denote D foldable ⊂ G m,n × H V m × H W n as the collection of all (G, H) ∈ G m,n × H V m × H W n that can be folded. For example, the MILP in Figure 1 is not foldable, while the two MILP instances in Figure 2 are both foldable. The foldable examples in Figure 2 have been analyzed in the proof of Lemma 3.2, and now we provide some analysis of the example in Figure 1 here. Since the vertex features are distinct h W 1 ̸ = h W 2 , one would obtain C 0,W 1 ̸ = C 0,W 2 as long as the hash function HASH 0,W is injective. Although h V 1 = h V 2 and hence C 0,V 1 = C 0,V 2 , the neighborhood information of v 1 and v 2 are different due to the difference of the edge weights. One could obtain C 1,V 1 ̸ = C 1,V 2 by properly choosing HASH 0,W , HASH ′ 1,W , HASH 1,V , which concludes the unfoldability. We prove that, as long as those foldable MILPs are removed, GNN is able to accurately predict the feasibility of all MILP instances in a dataset with finite samples. We use (F (G, H) > 1/2) as the criteria, and the indicator I F (G,H)>1/2 = 1 if F (G, H) > 1/2; I F (G,H)>1/2 = 0 otherwise. Theorem 4.2. For any finite dataset D ⊂ G m,n × H V m × H W n \D foldable , there exists F ∈ F GNN such that I F (G,H)>1/2 = Φ feas (G, H), ∀ (G, H) ∈ D. Similar results also hold for the optimal objective value mapping Φ obj and the optimal solution mapping Φ solu and we list the results below. All the proofs of theorems are deferred to the appendix. Theorem 4.3. Let D ⊂ G m,n × H V m × H W n \D foldable be a finite dataset. For any δ > 0, there exists F 1 ∈ F GNN such that I F1(G,H)>1/2 = I Φ obj (G,H)∈R , ∀ (G, H) ∈ D , and there exists some F 2 ∈ F GNN such that |F 2 (G, H) -Φ obj (G, H)| < δ, ∀ (G, H) ∈ D ∩ Φ -1 obj (R). Theorem 4.4. Let D ⊂ D solu \D foldable ⊂ G m,n × H V m × H W n \D foldable be a finite dataset. For any δ > 0, there exists F W ∈ F W GNN such that ∥F W (G, H) -Φ solu (G, H)∥ < δ, ∀ (G, H) ∈ D. Actually, conclusions in Theorems 4.2, 4.3 and 4.4 can be strengthened. The dataset D in the three theorems can be replaced with a measurable set with finite measure, which may contains infinitely many instances. Those strengthened theorems and their proofs can be found in Appendix A. Let us also mention that the foldability of an MILP instance may depend on the feature spaces H V and H W . If more features are appended (see Gasse et al. (2019) ), i.e., h V i and h W j have more entries, and then a foldable MILP problem may become unfoldable. In addition, all the analysis here works for any topological linear spaces H V and H W , as long as Φ feas , Φ obj , and Φ solu can be verified as measurable and permutation-invariant/equivariant mappings.

5. SYMMETRY BREAKING TECHNIQUES VIA RANDOM FEATURES

Although we prove that GNN is able to approximate Φ feas , Φ obj , Φ solu to any given precision for those unfoldable MILP instances, practitioners cannot benefit from that if there are foldable MILPs in their set of interest. For example, for around 1/4 of the problems in MIPLIB 2017 (Gleixner et al., 2021) , the number of colors generated by WL test is smaller than one half of the number of vertices, respect to the H V and H W used in this paper. To resolve this issue, we introduce a technique inspired by Abboud et al. (2020) ; Sato et al. (2021) . More specifically, we append the vertex features with an additional random feature ω and define a type of random GNNs as follows. Let Ω = [0, 1] m × [0, 1] n and let (Ω, F Ω , P) be the probability space corresponding to the uniform distribution U(Ω). The class of random graph neural network F R GNN with scalar output is the collection of functions F R : G m,n × H V m × H W n × Ω → R, (G, H, ω) → F R (G, H, ω), which is defined in the same way as F GNN with input space being G m,n × H V m × H W n × Ω ∼ = G m,n × (H V × [0, 1]) m × (H W × [0, 1]) n and ω being sampled from U(Ω). The class of random graph neural network F W,R GNN with vector output is the collection of functions F W,R : G m,n × H V m × H W n × Ω → R n , (G, H, ω) → F W,R (G, H, ω), which is defined in the same way as F W GNN with input space being G m,n × H V m × H W n × Ω ∼ = G m,n × (H V × [0, 1]) m × (H W × [0, 1]) n and ω being sampled from U(Ω). By the definition of graph neural networks, F R is permutation invariant and F W,R is permutation equivariant, where the permutations σ V and σ W act on (H V × [0, 1]) m and (H W × [0, 1]) n . We also write F R (G, H) = F R (G, H, ω) and F W,R (G, H) = F W,R (G, H, ω) as random variables. The theorem below states that, by adding random features, GNNs have sufficient power to represent MILP feasibility, even including those foldable MILPs. The intuition is that by appending additional random features, with probability one, each vertex will have distinct features and the resulting MILPgraph is hence unfoldable, even if it is foldable originally. Theorem 5.1. Let D ⊂ G m,n ×H V m ×H W n be a finite dataset. For any ϵ > 0, there exists F R ∈ F R GNN , such that P I F R (G,H)>1/2 ̸ = Φ feas (G, H) < ϵ, ∀ (G, H) ∈ D. Similar results also hold for the optimal objective value and the optimal solution. Theorem 5.2. Let D ⊂ G m,n × H V m × H W n be a finite dataset. For any ϵ, δ > 0, there exists F R,1 ∈ F R GNN , such that P I F R,1 (G,H)>1/2 ̸ = I Φ obj (G,H)∈R < ϵ, ∀ (G, H) ∈ D, and there exists F R,2 ∈ F R GNN , such that P (|F R,2 (G, H) -Φ obj (G, H)| > δ) < ϵ, ∀ (G, H) ∈ D ∩ Φ -1 obj (R). Theorem 5.3. Let D ⊂ Φ -1 obj (R) ⊂ G m,n × H V m × H W n be a finite dataset. For any ϵ, δ > 0, there exists F W,R ∈ F W,R GNN , such that P (∥F W,R (G, H) -Φ solu (G, H)∥ > δ) < ϵ, ∀ (G, H) ∈ D. The idea behind those theorems is that GNNs can distinguish (G, H, ω) and ( Ĝ, Ĥ, ω) with high probability, even if they cannot distinguish (G, H) and ( Ĝ, Ĥ). How to choose the random feature in training is significant. In practice, one may generate one random feature vector ω ∈ [0, 1] m × [0, 1] n for all MILP instances. This setting leads to efficiency in training GNN models, but the trained GNNs cannot be applied to datasets with MILP problems of different sizes. Another practice is sampling several independent random features for each MILP instance, but one may suffer from difficulty in training. Such a trade-off also occurs in other GNN tasks (Loukas, 2019; Balcilar et al., 2021) . In this paper, we generate one random vector for all instances (both training and testing instances) to directly validate our theorems. How to balance the trade-off in practice will be an interesting future topic.

6. NUMERICAL EXPERIMENTS

In this section, we experimentally validate our theories on some small-scale examples with m = 6 and n = 20. We first randomly generate two datasets D 1 and D 2 . Set D 1 consists of 1000 randomly generate MILPs that are all unfoldable, and there are 460 feasible MILPs whose optimal solutions are attachable while the others are all infeasible. Set D 2 consists of 1000 randomly generate MILPs that are all foldable and similar to the example provided in Figure 2 , and there are 500 feasible MILPs with attachable optimal solution while the others are infeasible. We call SCIP (Bestuzheva et al., 2021a; b) , a state-of-the-art non-commercial MILP solver, to obtain the feasibility and optimal solution for each instance. In our GNNs, we set the number of message-passing layers as L = 2 and parameterize all the learnable functions f V in , f W in , f out , f W out , {f V l , f W l , g V l , g W l } L l=0 as multilayer perceptrons (MLPs). Our codes are modified from Gasse et al. (2019) and released to https: //github.com/liujl11git/GNN-MILP.git. All the results reported in this section are obtained on the training sets, not an separate testing set. Generalization tests and details of the numerical experiments can be found in the appendix. Feasibility We first test whether GNN can represent the feasibility of an MILP and report our results in Figure 3 . The orange curve with tag "Foldable MILPs" presents the training result of GNN on set D 2 . It's clear that GNN fails to distinguish the feasible and infeasible MILP pairs that are foldable, whatever the GNN size we take. However, if we train GNNs on those unfoldable MILPs in set D 1 , it's clear that the rate of errors goes to zero, as long as the size of GNN is large enough (the number of GNN parameters is large enough). This result validates Theorem 4.2 and the first conclusion in Theorem 4.3: the existence of GNNs that can accurately predict whether an MILP is feasible (or whether an MILP has a finite optimal objective value). Finally, we append additional random features to the vertex features in GNN. As the green curve with tag "Foldable + Rand Feat." shown, GNN can perfectly fit the foldable data D 2 , which validates Theorem 5.1 and the first conclusion in Theorem 5.2. , we validate that GNN is able to approximate the optimal objective value and one optimal solution. Figure 4a shows that, by restricting datasets to unfoldable instances, or by appending random features to the graph, one can train a GNN that has arbitrarily small approximation error for the optimal objective value. Such conclusions validates Theorems 4.3 and 5.2. Figure 4b shows that GNNs can even approximate an optimal solution of MILP, though it requires a much larger size than the case of approximating optimal objective. Theorems 4.4 and 5.3 are validated. This work investigates the expressive power of graph neural networks for representing mixed-integer linear programming problems. It is found that the separation power of GNNs bounded by that of WL test is not sufficient for foldable MILP problems, and in contrast, we show that GNNs can approximate characteristics of unfoldable MILP problems with arbitrarily small error. To get rid of the requirement on unfoldability which may not be true in practice, a technique of appending random feature is discuss with theoretical guarantee. We conduct numerical experiments for all the theory. This paper will contribute to the recently active field of applying GNNs for MILP solvers.

A PROOFS FOR SECTION 4

In this section, we prove the results in Section 4, i.e., graph neural networks can approximate Φ feas , Φ obj , and Φ solu , on finite datasets of unfoldable MILPs with arbitrarily small error. We would consider more general results on finite-measure subset of G m,n × H V m × H W n which may involve infinite elements. In our settings, foldability is actually a concept only depending on the MILP instance. In the main text, we utilize the tool of the WL test to define foldability here for readability. Here we present a more complex but equivalent definition merely with the language of MILP. Definition A.1 (Foldable MILP). We say that the MILP problem (2.1) can be folded, or is foldable, if there exist I = {I 1 , I 2 , . . . , I s } and J = {J 1 , J 2 , . . . , J t } that are partitions of {1, 2, . . . , m} and {1, 2, . . . , n}, respectively, such that at least one of I and J is nontrivial, i.e., s < m or t < n, and that the followings hold for any p ∈ {1, 2, . . . , s} and q ∈ {1, 2, . . . , t}: vertices in {v i : i ∈ I p } share the same feature in H V , vertices in {w j : j ∈ J q } share the same feature in H W , and all column sums (and row sums) of the submatrix (A i,j ) i∈Ip,j∈Jq are the same. According to Chen et al. (2022, Appendix A) , the color refinement procedure of the WL test converges to the coarsest partitions I and J satisfying the conditions in Definition A.1, assuming no collision happens. Therefore, one can see that Definition 4.1 and Definition A.1 are equivalent. Before processing to establish the approximation theorems for GNNs on G m,n ×H V m ×H W n \D foldable , let us define the topology and the measure on G m,n × H V m × H W n . Topology and measure The space G m,n ×H V m ×H W n can be equipped with some natural topology and measure. In the construction of G m,n × H V m × H W n , let all Euclidean spaces be equipped with standard topology and the Lebesgue measure; let all discrete spaces be equipped with the discrete topology and the discrete measure with each elements being of measure 1. Then let all unions be disjoint unions and let all products induce the product topology and the product measure. Thus, the topology and the measure on G m,n × H V m × H W n are defined. We use the notation Meas(•) for the measure on G m,n × H V m × H W n . Now we can state the following three theorems that can be viewed as generalized version of Theorem 4.2, 4.3, and 4.4. Theorem A.2. Let X ⊂ G m,n × H V m × H W n \D foldable be measurable with finite measure. For any ϵ > 0, there exists some F ∈ F GNN , such that Meas (G, H) ∈ X : I F (G,H)>1/2 ̸ = Φ feas (G, H) < ϵ, where I • is the indicator function, i.e., I F (G,H)>1/2 = 1 if F (G, H) > 1/2 and I F (G,H)>1/2 = 0 otherwise. Theorem A.3. Let X ⊂ G m,n × H V m × H W n \D foldable be measurable with finite measure. For any ϵ, δ > 0, the followings hold: (i) There exists some F 1 ∈ F GNN such that Meas (G, H) ∈ X : I F1(G,H)>1/2 ̸ = I Φ obj (G,H)∈R < ϵ. (ii) There exists some F 2 ∈ F GNN such that Meas (G, H) ∈ X ∩ Φ -1 obj (R) : |F 2 (G, H) -Φ obj (G, H)| > δ < ϵ. Theorem A.4. Let X ⊂ D solu \D foldable ⊂ G m,n × H V m × H W n \D foldable be measurable with finite measure. For any ϵ, δ > 0, there exists F W ∈ F W GNN such that Meas ({(G, H) ∈ X : ∥F W (G, H) -Φ solu (G, H)∥ > δ}) < ϵ. The proof framework is similar to those in Chen et al. (2022) , and consists of two steps: i) show that measurability of the target mapping and apply Lusin's theorem to obtain a continuous mapping on a Published as a conference paper at ICLR 2023 compact domain; ii) use Stone-Weierstrass-type theorem to show the uniform approximation result of graph neural networks. We first prove that Φ feas and Φ obj are both measurable in the following two lemmas. The optimal solution mapping Φ solu will be defined rigorously and proved as measurable in Section C. Lemma A.5. The feasibility mapping Φ feas : G m,n × H V m × H W n → {0, 1} is measurable. Proof. It suffices to prove that Φ -1 feas (1) is measurable, and the proof is almost the same as the that of Chen et al. (2022, Lemma F.1) . The difference is that we should consider any fixed τ ∈ {0, 1} n . Assuming τ = (0, . . . , 0, 1, . . . , 1) where 0 and 1 appear for k and n -k times respectively without loss of generality, we can replace R n and Q n in the proof of Chen et al. (2022, Lemma F.1 ) by R k × Z n-k and Q k × Z n-k to get the proof in the MILP setting. Lemma A.6. The optimal objective value mapping Φ obj : G m,n × H V m × H W n → R ∪ {∞, -∞} is measurable. Proof. Consider any ϕ ∈ R, Φ obj (G, H) < ϕ if and only if for the MILP problem associated to (G, H), there exists a feasible solution x and some r ∈ N + such that c ⊤ x ≤ ϕ -1/r. The rest of the proof can be done using the same techniques as in the proofs of Chen et al. (2022, Lemma F.1 and Lemma F.2) , with the difference pointed in the proof of Lemma A.5. The Lusin's theorem, stated as follows, guarantees that any measurable function can be constricted on a compact such that only a small portion of domain is excluded and that the resulting function is continuous. The Lusin's theorem also applies for mappings defined on domains in G m,n ×H V m ×H W n . This is because that G m,n ×H V m ×H W n is isomorphic to the disjoint union of finitely many Euclidean spaces. Theorem A.7 (Lusin's theorem (Evans & Garzepy, 2018, Theorem 1.14) ). Let µ be a Borel regular measure on R n and let f : R n → R m be µ-measurable. Then for any µ-measurable X ⊂ R n with µ(X) < ∞ and any ϵ > 0, there exists a compact set E ⊂ X with µ(X\E) < ϵ, such that f | E is continuous. The main tool in the proof of Theorem A.2 and Theorem A.3 is the universal approximation property of F GNN stated below. Notice that the separation power of GNNs is the same as that of WL test. Theorem A.8 guarantees that GNNs can approximate any continuous function on a compact set whose separation power is less than or equal to that of the WL test, which can be proved using the Stone-Weierstrass theorem (Rudin, 1991 , Section 5.7). Theorem A.8 (Theorem 4.3 in Chen et al. (2022)  ). Let X ⊂ G m,n × H V m × H W n be a compact set. For any Φ ∈ C(X, R) satisfying Φ(G, H) = Φ( Ĝ, Ĥ), ∀ (G, H) ∼ ( Ĝ, Ĥ), (A.1) and any ϵ > 0, there exists F ∈ F GNN such that sup (G,H)∈X |Φ(G, H) -F (G, H)| < ϵ. (A.2) Let us remark that the proof of Chen et al. (2022, Theorem 4.3) does not depend on the specific choice of the feature spaces H V and H W , which only requires that H V and H W are both topological linear spaces. Therefore, Theorem A.8 works for MILP-graphs with additional vertices features. For the sake of completeness, we outline the proof of Chen et al. (2022, Theorem 4.3) Similar universal approximation results for GNNs and invariant target mappings can also be found in earlier literature; see e.g. Azizian & Lelarge (2021) ; Geerts & Reutter (2022) . With all the preparation above, we can then proceed to present the proof of Theorem A.2. : Let π : G m,n × H V m × H W n → G m,n × H V m × H W n / ∼ be the quotient. Then for any continuous func- tion F : G m,n × H V m × H W n → R, there exists a unique F : G m,n × H V m × H W n / ∼→ R such that F = F •π. Note that FGNN := { F : F ∈ F GNN } separates points in G m,n ×H V m ×H W n / ∼ For applying Theorem A.8, one need to verify that the separation power of Φ feas and Φ obj are bounded by that of WL test for unfoldable MILP instances, i.e., the condition (A.1). For any (G, H), ( Ĝ, Ĥ) ∈ G m,n × H V m × H W n \D foldable , since they are unfoldable, WL test should generate discrete coloring for them if there is no collision. If we further assume that they cannot be distinguished by WL test, then their stable discrete coloring must be identical (up to some permutation probably), which implies that they must be isomorphic. Therefore, we have the following lemma: Lemma A.9. For any (G, H), ( Ĝ, Ĥ) ∈ G m,n ×H V m ×H W n \D foldable , it holds that (G, H) ∼ ( Ĝ, Ĥ) if and only if (G, H) and ( Ĝ, Ĥ) are isomorphic, i.e., (σ V , σ W ) * (G, H) = ( Ĝ, Ĥ) for some σ V ∈ S m and σ W ∈ S n . Let us remark that Lemma A.9 is true for any vertex feature spaces because the following proof does not assume the structure of the feature spaces. Proof. It's trivial that the isomorphism of (G, H) and ( Ĝ, Ĥ) implies (G, H) ∼ ( Ĝ, Ĥ), and it suffices to show (G, H) ∼ ( Ĝ, Ĥ) implies the isomorphism. Applying Algorithm 1 on (G, H) and ( Ĝ, Ĥ) with a common set of hash functions, we define the outcomes respectively: {{C L,V i }} m i=0 , {{C L,W j }} n j=0 :=WL MILP ((G, H), L), {{ ĈL,V i }} m i=0 , {{ ĈL,W j }} n j=0 :=WL MILP (( Ĝ, Ĥ), L). Then (G, H) ∼ ( Ĝ, Ĥ) can be written as {{C L,V i }} m i=0 = {{ ĈL,V i }} m i=0 , and {{C L,W j }} n j=0 = {{ ĈL,W j }} n j=0 (A.3) for all L ∈ N and all hash functions. We define hash functions used in this proof. • We take hash functions HASH 0,V , HASH 0,W such that they are injective on the following domain: {h V i , h W j , ĥV i , ĥW j : 1 ≤ i ≤ m, 1 ≤ j ≤ n}. Note that {•} means a set, but not a multiset. The above set collects all possible values of the vertex features in the given two graphs (G, H) and ( Ĝ, Ĥ). Such hash functions exist because the number of vertices in the two graphs must be finite. • For l ∈ {1, 2, • • • , L}, we define a set that collects all possible vertex colors at the l-1-th iteration, C l-1 := {C l-1,V i , C l-1,W j , Ĉl-1,V i , Ĉl-1,W j : 1 ≤ i ≤ m, 1 ≤ j ≤ n}. Note that |C l-1 | is finite. The two sets {HASH ′ l,V (C) : C ∈ C l-1 } and {HASH ′ l,W (C) : C ∈ C l-1 } can be linearly independent for some hash functions HASH ′ l,V , HASH ′ l,W . This is because that one can take the output spaces of HASH ′ l,V , HASH ′ l,W as linear spaces of dimension greater than |C l-1 |. Given such functions HASH ′ l,V , HASH ′ l,W , we consider the following set      C l-1,V i , n j=1 E i,j HASH ′ l,W C l-1,W j   : 1 ≤ i ≤ m    ∪      Ĉl-1,V i , n j=1 Êi,j HASH ′ l,W Ĉl-1,W j   : 1 ≤ i ≤ m    ∪ C l-1,W j , n i=1 E i,j HASH ′ l,V C l-1,V i : 1 ≤ j ≤ n ∪ Ĉl-1,W j , n i=1 Êi,j HASH ′ l,V Ĉl-1,V i : 1 ≤ j ≤ n , which is a finite set given any two graphs (G, H) and ( Ĝ, Ĥ). Thus, one can pick hash functions HASH l,V , HASH l,W that are injective on the above set. Due to the statement assumption that (G, H) and ( Ĝ, Ĥ) are both unfoldable and the definition of the above hash functions, one can obtain that C L,V i ̸ = C L,V i ′ , ∀ i ̸ = i ′ , C L,W j ̸ = C L,W j ′ , ∀ j ̸ = j ′ , ĈL,V i ̸ = ĈL,V i ′ , ∀ i ̸ = i ′ , ĈL,W j ̸ = ĈL,W j ′ , ∀ j ̸ = j ′ , (A.4) for some L ∈ N. In another word, in either (G, H) or ( Ĝ, Ĥ), each vertex has an unique color. Combining (A.3) and (A.4), we have that there exists unique σ V ∈ S m and σ W ∈ S n such that j) , ∀ i = 1, 2, . . . , m, j = 1, 2, . . . , n. In addition, the vertices with the same colors much have the same neighbour information, including the colors of the nerighbourhood and weights of the associated edges. Rigorously, since HASH L,V , HASH L,W are both injective, one will obtain that ĈL,V i = C L,V σ V (i) , ĈL,W j = C L,W σ W ( ĈL-1,V i = C L-1,V σ V (i) , ĈL-1,W j = C L-1,W σ W (j) , n j=1 Êi,j HASH ′ L,W ĈL-1,W j = n j=1 E σ V (i),j HASH ′ L,W C L-1,W j , n i=1 Êi,j HASH ′ L,V ĈL-1,V i = n i=1 E i,σ W (j) HASH ′ L,V C L-1,V i . Now let us consider the second line of the above equations, n j=1 Êi,j HASH ′ L,W ĈL-1,W j = n j=1 E σ V (i),j HASH ′ L,W C L-1,W j = n j=1 E σ V (i),σ W (j) HASH ′ L,W C L-1,W σ W (j) = n j=1 E σ V (i),σ W (j) HASH ′ L,W ĈL-1,W j . Therefore, it holds that n j=1 Êi,j -E σ V (i),σ W (j) HASH ′ L,W ĈL-1,W j = 0. This implies that Êi,j = E σ V (i),σ W (j) because {HASH ′ L,W (C) : C ∈ C L-1 } are linearly indepen- dent. Then one can conclude that (σ V , σ W ) * (G, H) = ( Ĝ, Ĥ).

Now we can proceed to present the proof of Theorem A.2.

Proof of Theorem A.2. According to Lemma A.5, the mapping Φ feas : G m,n × H V m × H W n → {0, 1} is measurable. By Lusin's theorem, there exists a compact E ⊂ X ⊂ G m,n × H V m × H W n \D foldable such that Φ feas | E is continuous and that Meas(X\E) < ϵ. For any (G, H), ( Ĝ, Ĥ) ∈ E, if (G, H) ∼ ( Ĝ, Ĥ), Lemma A.9 guarantees that Φ feas (G, H) = Φ feas ( Ĝ, Ĥ). Then using Theorem A.8, one can conclude that there exists F ∈ F GNN with sup (G,H)∈E |F (G, H) -Φ feas (G, H)| < 1 2 .

This implies that

Meas (G, H) ∈ X : I F (G,H)>1/2 ̸ = Φ feas (G, H) ≤ Meas(X\E) < ϵ. Similar techniques can be used to prove a sequence of results: Proof of Theorem 4.2. D is finite and compact and Φ feas | D is continuous. The rest of the proof follows the same argument as in the proof of Theorem A.2, using Lemma A.9 and Theorem A.8. Proof of Theorem A.3. Φ obj is measurable by Lemma A.6, which implies that I Φobj(•,•)∈R : G m,n × H V m × H W n → {0, 1} , is also measurable. Then (i) and (ii) can be proved using the same argument as in the proof of Theorem A.2 for X and I Φobj(•,•)∈R , as well as X ∩ Φ -1 obj (R) and Φ obj , respectively. Proof of Theorem 4.3. D is compact and I Φobj(•,•)∈R restricted on D is continuous. In addition, as long as D ∩ Φ -1 obj (R) is nonempty, it is compact and Φ obj restricted on D ∩ Φ -1 obj (R) is continuous. Then one can use Lemma A.9 and Theorem A.8 to obtain the desired results. Now we consider the optimal solution mapping Φ solu and Theorem A.4, for proving which one needs a universal approximation result of F W GNN for equivariant functions. For stating the theorem rigorously, we define another equivalence relationship on G m,n × H V m × H W n : (G, H) W ∼ ( Ĝ, Ĥ) if (G, H) ∼ ( Ĝ, Ĥ) and in addition, C L,W j = ĈL,W j , ∀ j ∈ {1, 2, . . . , n}, for any L ∈ N and any hash functions.

The universal approximation result of F W

GNN is stated as follows, which guarantees that F W GNN can approximate any continuous equivariant function on a compact set whose separation power is less than or equal to that of the WL test, and is based on a generalized Stone-Weierstrass theorem established in Azizian & Lelarge (2021) . Theorem A.10 (Theorem E.1 in Chen et al. (2022) ). Let X ⊂ G m,n ×H V m ×H W n be a compact subset that is closed under the action of S m × S n . Suppose that Φ ∈ C(X, R n ) satisfies the followings: (i) For any σ V ∈ S m , σ W ∈ S n , and (G, H) ∈ X, it holds that Φ ((σ V , σ W ) * (G, H)) = σ W (Φ(G, H)). (ii) Φ(G, H) = Φ( Ĝ, Ĥ) holds for all (G, H), ( Ĝ, Ĥ) ∈ X with (G, H) W ∼ ( Ĝ, Ĥ). (iii) Given any (G, H) ∈ X and any j, j ′ ∈ {1, 2, . . . , n}, if C l,W j = C l,W j ′ holds for any l ∈ N and any choices of hash functions, then Φ(G, H) j = Φ(G, H) j ′ . Then for any ϵ > 0, there exists F W ∈ F W GNN such that sup (G,H)∈X ∥Φ(G, H) -F W (G, H)∥ < ϵ. Similar to Theorem A.8, Theorem A.10 is also true for all topological linear spaces H V and H W . We refer to Azizian & Lelarge (2021) ; Geerts & Reutter (2022) for other results on the closure of equivariant GNNs. Now we can prove Theorem A.4 and Corollary 4.4. Proof of Theorem A.4. We can assume that X ⊂ D solu \D foldable ⊂ G m,n × H V m × H W n \D foldable is close under the action of S m × S n , since otherwise one can replace X with (σ V ,σ W )∈Sm×Sn (σ V , σ W ) * X. The solution mapping Φ solu : D solu \D foldable is measurable by Lemma C.5. According to Lusin's theorem, there exists a compact E ′ ⊂ X ⊂ D solu \D foldable with Meas(X\E ′ ) < ϵ/|S m × S n | such that Φ solu | E ′ is continuous. Let us set E = (σ V ,σ W )∈Sm×Sn (σ V , σ W ) * E ′ ⊂ X, which is compact and is closed under the action of S m × S n . Furthermore, it holds that Meas(X\E) = Meas   X\ (σ V ,σ W )∈Sm×Sn (σ V , σ W ) * E ′   ≤ (σ V ,σ W )∈Sm×Sn Meas(X\(σ V , σ W ) * E ′ ) = |S m × S n | • Meas(X\E ′ ) < ϵ. For any (G, H), ( Ĝ, Ĥ) ∈ E, if (G, H) W ∼ ( Ĝ, Ĥ), similar to Lemma A.9, we know that there exists σ V ∈ S m such that (σ V , Id) * (G, H) = ( Ĝ, Ĥ). Then by the construction of Φ solu in Section C, one has that Φ solu (G, H) = Φ solu ( Ĝ, Ĥ) ∈ R n . Condition (i) in Theorem A.10 is satisfied by the definition of Φ solu (see Section C) and Condition (iii) in Theorem A.10 follows from the fact that WL test yields discrete coloring for any graph in G m,n × H V m × H W n \D foldable . Then applying Theorem A.10, one can conclude that for any δ > 0, there exists F W ∈ F W GNN with sup (G,H)∈E ∥F W (G, H) -Φ solu (G, H)∥ < δ. Therefore, it holds that Meas ({(G, H) ∈ X : ∥F W (G, H) -Φ solu (G, H)∥ > δ}) ≤ Meas(X\E) < ϵ, which completes the proof. Proof of Theorem 4.4. Without loss of generality, we can assume that the finite dataset D ⊂ D solu \D foldable is closed under the action of S m × S n . Otherwise, we can replace D by (σ V ,σ W )∈Sm×Sn (σ V , σ W ) * D. Note that D is compact and Φ solu | D is continuous. The rest of the proof can be done by using similar techniques as in the proof of Theorem A.4, with Theorem A.10.

B PROOFS FOR SECTION 5

We collect the proof of Theorem 5.1, Theorem 5.2, and Theorem 5.3 in this section. Before we proceed, we remark that the following concepts are modified as follows in this section due to the random feature technique: • The space of MILP-graph. Each vertex in the MILP-graph is equipped with an additional feature and the entire random vector ω is sampled from ω ∈ Ω := [0, 1] m × [0, 1] n . More specifically, we write: ω := (ω V , ω W ) := (ω V 1 , • • • , ω V m , ω W 1 , • • • , ω W n ). Equipped with a random feature, the new vertex features are defined with: h V,R i := (h V i , ω V i ), h W,R j := (h W j , ω W j ), ∀ 1 ≤ i ≤ m, 1 ≤ j ≤ n. The corresponding vertex feature spaces are defined with: H V,R m := H V m × [0, 1] m = (H V × [0, 1]) m , H W,R n := H W n × [0, 1] n = (H W × [0, 1]) n . The corresponding graph space is G m,n × H V,R m × H W,R n , and it is isomorphic to the space defined in the main text: G m,n × H V m × H W n × Ω ∼ = G m,n × H V,R m × H W,R n . (B.1) Note that the two spaces are actually the same, merely defined in different way. In the main text, we adopt the former due to its simplicity. In the proof, we may either use the former or the latter according to the context. • Topology and measure. We equip Ω with the standard topology and the Lebesgue measure. Then the overall space G m,n × H V m × H W n × Ω can be naturally equipped with the product topology and the product measure. • Key mappings. On the space (with random features) G m,n × H V m × H W n × Ω, the three key mappings can be defined with Φ feas (G, H, ω) :=Φ feas (G, H), Φ obj (G, H, ω) :=Φ obj (G, H), Φ solu (G, H, ω) :=Φ solu (G, H), for all ω ∈ Ω. This is due to the fact that the feasibility, objective and optimal solution only depend on the MILP instance and they are independent of the random features appended. • Invariance and equivariance. For any permutations σ V ∈ S m and σ W ∈ S n , (σ V , σ W ) * (G, H, ω) denotes the reordered MILP-graph (equipped with random features). We say a function F R is permutation invariant if F R ((σ V , σ W ) * (G, H, ω)) = F R (G, H, ω), for all σ V ∈ S m , σ W ∈ S n , and (G, H, ω) ∈ G m,n × H V,R m × H W,R n . And we say a function F W,R is permutation equivariant if F W,R ((σ V , σ W ) * (G, H, ω)) = σ W (F W,R (G, H, ω)), for all σ V ∈ S m , σ W ∈ S n , and (G, H, ω) ∈ G m,n × H V,R m × H W,R n . It's clear that Φ feas and Φ obj are permutation invariant, and Φ solu is permutation equivariant. Furthermore, it holds that any F R ∈ F R GNN is invariant and any F W,R ∈ F W,R GNN is equivariant. • The WL test. In the WL test, vertex features h V i , h W j are replaced with h V,R i , h W,R j , respectively, and the corresponding vertex colors are denoted by C l,V,R i , C 0,W,R j in the l-th iteration. Initialization: C 0,V,R i = HASH 0,V (h V,R i ), C 0,W,R j = HASH 0,W (h W,R j ). For l = 1, 2, • • • , L, C l,V,R i = HASH l,V   C l-1,V,R i , n j=1 E i,j HASH ′ l,W C l-1,W,R j   . For l = 1, 2, • • • , L, C l,W,R j = HASH l,W C l-1,W,R j , m i=1 E i,j HASH ′ l,V C l-1,V,R i . (B. 2) Note that the update form of (B.2) is equal to that of Algorithm 1, and the only difference is the initial vertex features. • The equivalence relationship. Now let us define the equivalence relationships ∼ and W ∼ on the space equipped with random features G m,n × H V m × H W n × Ω. Given two MILPgraphs with random features (G, H, ω) and ( Ĝ, Ĥ, ω), the outcomes of the WL test are {{C L,V,R i }} m i=0 , {{C L,W,R j }} n j=0 and {{ ĈL,V,R i }} m i=0 , {{ ĈL,W,R j }} n j=0 , respectively. We say (G, H, ω) ∼ ( Ĝ, Ĥ, ω) if {{C L,V,R i }} m i=0 = {{ ĈL,V,R i }} m i=0 and {{C L,W,R j }} n j=0 = {{ ĈL,W,R j }} n j=0 , for all for any L ∈ N and any hash functions. We say (G, H, ω)  W ∼ ( Ĝ, Ĥ, ω) if (G, H, ω) ∼ ( Ĝ, Ĥ, ω) and in addition, C L,W,R j = ĈL,W,R j , ∀ j ∈ {1, 2, . . . , G, H, ω) are C l,V,R i , C 0,W,R j . Then we say (G, H, ω) is foldable if there exist 1 ≤ i, i ′ ≤ m or 1 ≤ j, j ′ ≤ n such that C l,V,R i = C l,V,R i ′ , i ̸ = i ′ , or C l,W,R j = C l,W,R j ′ , j ̸ = j ′ , for all l ∈ N and any hash functions.

By merely replacing h

V i , h W j with h V,R i , h W,R j , we define all the required preliminaries. According to the remarks following Theorem A.8, Lemma A.9, and Theorem A.10, we are able to directly apply these theorems and lemmas on the space G m,n × H V,R m × H W,R n , where H V,R m corresponds to H V m and H W,R n corresponds to H W n . Proof of Theorem 5.1. Define Ω F = ω ∈ Ω :∃ i ̸ = i ′ ∈ {1, 2, . . . , m}, s.t. ω V i = ω V i ′ , or ∃ j ̸ = j ′ ∈ {1, 2, . . . , n}, s.t. ω W j = ω W j ′ . (B.3) Given any (G, H) ∈ G m,n × H V m × H W n , as long as ω ̸ ∈ Ω F , different vertices in V or W are equipped with different vertex features: h V,R i ̸ = h V,R i ′ , ∀i ̸ = i ′ , h W,R j ̸ = h W,R j ′ , ∀j ̸ = j ′ , consequently, each of the vertices possesses an unique initial color: C 0,V,R i ̸ = C 0,V,R i ′ , ∀i ̸ = i ′ , C 0,W,R j ̸ = C 0,W,R j ′ , ∀j ̸ = j ′ , for some HASH 0,V and HASH 0,W . In another word, although (G, H) may be foldable, (G, H, ω) must be unfoldable as long as ω ̸ ∈ Ω F . Therefore, for any (G, H, ω), ( Ĝ, Ĥ, ω) ∈ G m,n ×H V m ×H W n ×(Ω\Ω F ), it holds that both (G, H, ω) and ( Ĝ, Ĥ, ω) are unfoldable. Applying Lemma A.9, we have (G, H, ω) ∼ ( Ĝ, Ĥ, ω) if and only if (σ V , σ W ) * (G, H, ω) = ( Ĝ, Ĥ, ω) for some σ V ∈ S m and σ W ∈ S n . The invariance property of Φ feas implies that Φ feas (G, H, ω) = Φ feas ( Ĝ, Ĥ, ω), ∀ (G, H, ω) ∼ ( Ĝ, Ĥ, ω) ∈ G m,n × H V m × H W n × (Ω\Ω F ). In addition, the probability of ω ∈ Ω F is zero due to the uniform distribution of ω : P(Ω F ) = 0. For any ϵ > 0, there exists a compact subset Ω ϵ ⊂ Ω\Ω F such that P(Ω\Ω ϵ ) = P ((Ω\Ω F )\Ω ϵ ) < ϵ. Note that D × Ω ϵ is also compact and that Φ feas is continuous on D × Ω ϵ . The continuity of Φ feas | D×Ωϵ follows from the condition that D is a finite dataset and that the fact that any mapping with discrete finite domain is continuous (note that Φ feas (G, H, ω) is independent of ω ∈ Ω ϵ ). Applying Theorem A.8 for Φ feas and D × Ω ϵ ⊂ G m,n × H V,R m × H W,R n , one can conclude the existence of F R ∈ F R GNN with sup (G,H,ω)∈D×Ωϵ |F R (G, H, ω) -Φ feas (G, H, ω)| < 1 2 . Since Φ feas is independent of the random features ω, the above equation can be rewritten as sup (G,H,ω)∈D×Ωϵ |F R (G, H, ω) -Φ feas (G, H)| < 1 2 . It thus holds for any (G, H) ∈ D that P I F R (G,H)>1/2 ̸ = Φ feas (G, H) ≤ P(Ω\Ω ϵ ) < ϵ. Proof of Theorem 5.2. The results can be obtained by applying similar techniques as in the proof of Theorem 5.1 for I Φobj(•,•)∈R and Φ obj . Proof of Theorem 5.3. Without loss of generality, we can assume that D is closed under the action of S m × S n ; otherwise, we can use {(σ V , σ W ) * (G, H) : (G, H) ∈ D, σ V ∈ S m , σ W ∈ S n } instead of D. Let Ω F ⊂ Ω be the set defined in (B.3) which is clearly closed under the action of S m × S n . There exists a compact Ω ′ ϵ ⊂ Ω\Ω F with P(Ω\Ω ′ ϵ ) = P((Ω\Ω F )\Ω ′ ϵ ) < ϵ/|S m × S n |. Define Ω ϵ = (σ V ,σ W )∈Sm×Sn (σ V , σ W ) * Ω ′ ϵ ⊂ Ω\Ω F , which is compact and closed under the action of S m × S n . One has that P(Ω\Ω ϵ ) ≤ P   Ω\ (σ V ,σ W )∈Sm×Sn (σ V , σ W ) * Ω ′ ϵ   ≤ (σ V ,σ W )∈Sm×Sn P(Ω\(σ V , σ W ) * Ω ′ ϵ ) = |S m × S n | • P(Ω\Ω ′ ϵ ) < ϵ. In addition, D × Ω ϵ is compact in G m,n × (H V × [0, 1]) m × (H W × [0, 1]) n and is closed under the action of S m × S n . We then verify the three conditions in Theorem A.10 for Φ solu and D × Ω ϵ . Condition (i) holds automatically by the definition of the optimal solution mapping. For Condition (ii), given any (G, H, ω) and ( Ĝ, Ĥ, ω) in D × Ω ϵ with (G, H, ω) W ∼ ( Ĝ, Ĥ, ω), since graphs with vertex features in D × Ω ϵ ⊂ G m,n × H V n × H W n × (Ω\Ω F ) cannot be folded, similar to Lemma A.9, one can conclude that ( Ĝ, Ĥ, ω) = (σ V , Id) * (G, H, ω) for some σ V ∈ S n , which leads to Φ solu (G, H) = Φ solu ( Ĝ, Ĥ). Condition (iii) also follows from the fact that any graph in D × Ω ϵ cannot be folded and WL test yields discrete coloring. Therefore, Theorem A.10 applies and there exists F W,R ∈ F W,R GNN such that sup 

C THE OPTIMAL SOLUTION MAPPING Φ SOLU

In this section, we define the equivariant optimal solution mapping Φ solu , and prove the measurability. The definition consists of several steps.

C.1 THE SORTING MAPPING

We first define a sorting mapping: Φ sort : G m,n × H V m × H W n \D foldable → S n , that returns a permutation on {w 1 , w 2 , . . . , w n }, D foldable is the collection of all foldable MILP problem as in Definition 4.1. This can be done via an order refinement procedure, similar to the WL test. The initial order as well as the order refinement are defined in the following several definitions. Definition C.1. We define total order on H V and H W using lexicographic order: (i) For any (b i , • i ), (b i ′ , • i ′ ) ∈ H V = R × {≤, =, ≥}, we say (b i , • i ) < (b i ′ , • i ′ ) if one of the followings holds: -b i < b i ′ . -b i = b i ′ and ι(• i ) < ι(• i ′ ) , where ι(≤) = -1, ι(=) = 0, and ι(≥) = 1. (ii) For any (c j , l j , u j , τ j ), (c j ′ , l j ′ , u j ′ , τ j ′ ) ∈ H W = R × (R ∪ {-∞}) × (R ∪ {+∞}) × {0, 1}, we say (c j , l j , u j , τ j ) < (c j ′ , l j ′ , u j ′ , τ j ′ ) if one of the followings holds: c j < c j ′ . c j = c j ′ and l j < l j ′ . c j = c j ′ , l j = l j ′ , and u j < u j ′ . c j = c j ′ , l j = l j ′ , u j = u j ′ , and τ j < τ j ′ . Definition C.2. Let X = {{x 1 , x 2 , . . . , x k }} and X ′ = {{x ′ 1 , x ′ 2 , . . . , x ′ k ′ }} be two multisets whose elements are taken from the same totally-ordered set. Then we say X ≤ X ′ if x σ(1) x σ(2) • • • x σ(k) ≤ x ′ σ ′ (1) x ′ σ ′ (2) • • • x ′ σ ′ (k ′ ) in the sense of lexicographical order, where σ ∈ S k and σ ′ ∈ S k ′ are permutations such that x σ(1) ≤ x σ(2) ≤ • • • ≤ x σ(k) and x ′ σ ′ (1) ≤ x ′ σ ′ (2) ≤ • • • ≤ x ′ σ ′ (k ′ ) . Definition C.3. Suppose that V = {v 1 , v 2 , . . . , v n } and W = {w 1 , w 2 , . . . , w n } are already ordered, and that E ∈ R m×n . The order refinement is defined lexicographicaly: (i) For any i, i ′ ∈ {1, 2, . . . , m}, we say that (v i , {{(E i,j , w j ) : E i,j ̸ = 0}}) < (v i ′ , {{(E i ′ ,j , w j ) : E i ′ ,j ̸ = 0}}) , if one of the followings holds: -v i < v i ′ . v i = v i ′ and {(E i,j , w j ) : E i,j ̸ = 0} < {(E i ′ ,j , w j ) : E i ′ ,j ̸ = 0} in the sense of Definition C.2 where (E i,j , w j ) < (E i ′ ,j ′ , w j ′ ) if and only if E i,j < E i ′ ,j ′ or E i,j = E i ′ ,j ′ , w j < w j ′ . (ii) For any j, j ′ ∈ {1, 2, . . . , n}, we say that (w j , {{(E i,j , v i ) : E i,j ̸ = 0}}) < (w j ′ , {{(E i,j ′ , v i ) : E i,j ′ ̸ = 0}}) , if one of the followings holds: w j < w j ′ . w j = w j ′ and {(E i,j , v i ) : E i,j ̸ = 0} < {(E i,j ′ , v i ) : E i,j ′ ̸ = 0} in the sense of Definition C.2 where (E i,j , v i ) < (E i ′ ,j ′ , v i ′ ) if and only if E i,j < E i ′ ,j ′ or E i,j = E i ′ ,j ′ , v i < v i ′ . With the preparations above, we can now define Φ sort in Algorithm 2. Algorithm 2 The sorting mapping Φ sort Require: A weighted bipartite graph G = (V ∪ W, E) ∈ G m,n , with vertex features Remark C.4. The output of Algorithm 2 is well-defined and is unique. This is because that we use unfoldable (G, H) as input. Note that the order refinement in Definition C.3 is more strict than the color refinement in WL test. Therefore, after m + n iterations, there is no j ̸ = j ′ ∈ {1, 2, . . . , n} with w j = w j ′ , since the WL test returns discrete coloring if no collisions for (G, H) / ∈ D foldable . H = (h V 1 , h V 2 , . . . , h V m , h W 1 , h W 2 , . . . , h W n ) ∈ H V m × H W n such that (G, The sorting mapping Φ sort has some straightforward properties: • Φ sort is equivariant: Φ sort ((σ V , σ W ) * (G, H)) = (σ V , σ W ) * Φ sort (G, H) = σ W • Φ sort (G, H), for any σ V ∈ S m , σ W ∈ S n , and (G, H) ∈ G m,n × H V m × H W n \D foldable . This is due to the fact that Definition C.3 defines the total order and its refinement solely depending on the vertex features that are independent of the input order. • Φ sort is measurable, where the range S n is equipped with the discrete measure. This is because that Φ sort is defined via finitely many comparisons.



In Gasse et al. (2019) and other related empirical studies, the feature spaces H V , H W are more complicated than those defined in this paper. We only keep the most basic features here for simplicity of analysis. If we remove the integer constraints I = ∅ and let MILP reduces to linear programming (LP), the solution mapping will be easier to define. In this case, as long as the optimal objective value is finite, there must exist an optimal solution, and the optimal solution with the smallest ℓ2-norm is unique(Chen et al., 2022). Therefore, a mapping Φsolu, which maps an LP to its optimal solution with the smallest ℓ2-norm, is well defined on Φ -1 obj (R).



Figure 3: Feasibility

Then we take the feasible instances from sets D 1 and D 2 and form new datasets D feasible 1

Figure 4: GNN can approximate Φ obj , and Φ solu

by Theorem 3.1. By the Stone-Weierstrass theorem, there exists F ∈ FGNN with (G,H)∈X | Φ(π(G, H)) -(π(G, H))| < ϵ, which then implies (A.2).

(G,H,ω)∈D×Ωϵ ∥F W,R (G, H, ω) -Φ solu (G, H)∥ < δ, which implies thatP (∥F W,R (G, H) -Φ solu (G, H)∥ > δ) ≤ R(Ω\Ω ϵ ) < ϵ, ∀ (G, H) ∈ D.Remark B.1. A more deeper observation is that, Theorem 5.1, Theorem 5.2, and Theorem A.3 are actually true even if we only allow one message-passing layer in the GNN structure. This is because that the separation power of GNNs with one message-passing layer is the same as the WL test with one iteration (seeChen et al. (2022, Appendix C)), and that one iteration in WL test suffices to yield a discrete coloring for MILP-graphs in G m,n × H V n × H W n × (Ω\Ω F ).

H) / ∈ D foldable . 1: Order V and W according to Definition C.1. 2: for l = 1 : m + n do 3:Refine the ordering on V and W according to Definition C.3. 4: end for5: Return σ W ∈ S n such that w σ W (1) < w σ W (2) < • • • < w σ W (n) .

n}, for any L ∈ N and any hash functions.• Foldability. Suppose the vertex colors corresponding to (

C.2 THE OPTIMAL SOLUTION MAPPING

We then define the optimal solution mapping Φ solu : D solu \D foldable → R n , based on Φ sort , where, is the collection of all feasible MILP problems in which every component in l and u is finite.We have mentioned before that any MILP problem in D solu admits at least one optimal solution. For any (G, H) ∈ D solu \D foldable , we denote X solu (G, H) ∈ R n as the set of optimal solutions to the MILP problem associated to (G, H). One can see that X solu (G, H) is compact since for every (G, H) ∈ D solu \D foldable , both l and u are finite, which leads to the boundedness (and hence the compactness) of X solu (G, H).Given any permutation σ ∈ S n , let us define a total order on R d : H ) with respect to the order σ ≺, where σ = Φ sort (G, H). The existence and the uniqueness are true since X solu (G, H) is compact. More explicitly, the components of Φ solu (G, H) can be determined recursively:andfor j = 2, 3, . . . , n. It follows from the equivariance of Φ sort (G, H) and X solu (G, H) that Φ solu (G, H) is also equivariant, i.e.,We then show the measurability of Φ solu . Lemma C.5. The optimal solution mapping Φ solu : D solu \D foldable → R n is measurable.Proof. It suffices to show that for any fixed • ∈ {≤, =, ≥} m , τ ∈ {0, 1} n , and σ ∈ S n , the mappingis measurable for all j ∈ {1, 2, . . . , n}, whereis the embedding map when • and τ are fixed. Without loss of generality, we assume that • = {≤, . . . , ≤, =, . . . , =, ≥, . . . , ≥} where ≤, =, and ≥ appear for k 1 , k 2 -k 1 , and m -k 2 times, respectively, and that τ = (0, . . . , 0, 1, . . . , 1) where 0 and 1 appear for k and n -k times, respectively. Note that the domain of Φ j , i.e., ι -1is an optimal solution to the problem ι(A, b, c, l, u) if and only if V solu (A, b, c, l, u, x) = 0. In addition, V solu is measurable with respect to (A, b, c, l, u) for any x, by the measurability of Φ obj (see Lemma A.6) and continuity of V feas , and is continuous with respect to x.Then we proceed to prove that Φ j is measurable by induction. We first consider the case that j = 1. For any (A, b, c, l, u) ∈ ι -1 (D solu \D foldable ) ∩ (Φ sort • ι) -1 (σ) and any ϕ ∈ R, the followings are equivalent:• There exist r ∈ N + and x ∈ X solu (ι(A, b, c, l, u)), such that x σ(1) ≤ ϕ -1/r.• There exists r ∈ N + , for anyTherefore, it holds thatis measurable, which implies the measurability of Φ 1Then we assume that Φ 1 , . . . , Φ j-1 (j ≥ 2) are all measurable and show that Φ j is also measurable. Definesolu is also measurable respect to (A, b, c, l, u) for any x and is continuous with respect to x. For any (A, b, c, l, u) ∈ ι -1 (D solu \D foldable ) ∩ (Φ sort • ι) -1 (σ), the followings are equivalent:• There exist r ∈ N + and x ∈ R k × Z n-k , such that V j solu (A, b, c, l, u, x) = 0 and that x σ(j) ≤ ϕ -1/r.• There exists r ∈ N + , for any r ′ ∈ N + , x σ(j) ≤ ϕ -1/r holds for someTherefore, Φ -1 j (-∞, ϕ) can be expressed in similar format as Φ -1 1 (-∞, ϕ), and is hence measurable.

D DETAILS OF THE NUMERICAL EXPERIMENTS AND EXTRA EXPERIMENTS

MILP instance generation Each instance in D 1 has 20 variables, 6 constraints and is generated with:• For each variable, c j ∼ N (0, 0.01), l j , u j ∼ N (0, 10). If l j > u j , then switch l j and u j . The probability that x j is an integer variable is 0.5.• For each constraint, • i ∼ U({≤, =, ≥}) and b i ∼ N (0, 1).• A has 60 nonzero elements with each nonzero element distributing as N (0, 1).Each instance in D 2 has 20 variables, 6 equality constraints, and we construct the (2k -1)-th and 2k-th problems via following approach (1 ≤ k ≤ 500)• Sample J = {j 1 , j 2 , . . . , j 6 } as a random subset of {1, 2, . . . , 20} with 6 elements. For j ∈ J, x j ∈ {0, 1}. For j / ∈ J, x j is a continuous variable with bounds l j , u j ∼ N (0, 10). If l j > u j , then switch l j and u j .• The constraints for the (2k-1)-th problem (feasible) is• The constraints for the 2k-th problem (infeasible) isMLP architectures As we mentioned in the main text, all the learnable functions in GNN are taken as MLPs.All the learnable functionsare parameterized with multilayer perceptrons (MLPs) and have two hidden layers.The embedding size d 0 , • • • , d L are uniformly taken as d that is chosen from {2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048}. All the activation functions are ReLU.Training settings We use Adam (Kingma & Ba, 2014) as our training optimizer with learning rate of 0.0001. The loss function is taken as mean squared error. All the experiments are conducted on a Linux server with an Intel Xeon Platinum 8163 GPU and eight NVIDIA Tesla V100 GPUs.

Extra experiments on generalization

We conduct some numerical experiments on generalization for the random-feature approach. The GNN/MLP architecture is the same as what we describe before with the embedding size being d = 8. The size of the training set is chosen from {10, 100, 1000} and the size of the testing set is 1000, where all instances are generated in the same way as D 2 , with the only difference that we set c0 makes every feasible solution as optimal and thus the labels depend on the solver's choice. By setting c 1 = • • • = c 20 = 0.01, one can make the label depend only on the MILP problem itself, which fits our purpose, i.e., representing characteristics of MILP problems, better. In addition, our datasets for feasibility consists of 50% feasible instances and 50% infeasible ones, while other datasets are obtained by removing infeasible samples.We report the error on training set and the testing set in Table 1 . The generalization performance is good in our setting when the training size is relatively large. Although our experiments are still small-scale, these results indicate the potential of the random-feature approach in solving MILP problems in practice. 

