LEARNING CONTEXT-AWARE ADAPTIVE SOLVERS TO ACCELERATE CONVEX QUADRATIC PROGRAMMING

Abstract

Convex quadratic programming (QP) is an important sub-field of mathematical optimization. The alternating direction method of multipliers (ADMM) is a successful method to solve QP. Even though ADMM shows promising results in solving various types of QP, its convergence speed is known to be highly dependent on the step-size parameter ρ. Due to the absence of a general rule for setting ρ, it is often tuned manually or heuristically. In this paper, we propose CA-ADMM (Context-aware Adaptive ADMM)) which learns to adaptively adjust ρ to accelerate ADMM. CA-ADMM extracts the spatio-temporal context, which captures the dependency of the primal and dual variables of QP and their temporal evolution during the ADMM iterations. CA-ADMM chooses ρ based on the extracted context. Through extensive numerical experiments, we validated that CA-ADMM effectively generalizes to unseen QP problems with different sizes and classes (i.e., having different QP parameter structures). Furthermore, we verified that CA-ADMM could dynamically adjust ρ considering the stage of the optimization process to accelerate the convergence speed further.

1. INTRODUCTION

Among the optimization classes, quadratic program (QP) is widely used due to its mathematical tractability, e.g. convexity, in various fields such as portfolio optimization (Boyd et al., 2017; Cornuéjols et al., 2018; Boyd et al., 2014; Markowitz, 1952) , machine learning (Kecman et al., 2001; Sha et al., 2002) , control, (Buijs et al., 2002; Krupa et al., 2022; Bartlett et al., 2002) , and communication applications (Luo & Yu, 2006; Hons, 2001) . As the necessity to solve large optimization problems increases, it is becoming increasingly important to ensure the scalability of QP for achieving the solution of the large-sized problem accurately and quickly. Among solution methods to QP, first-order methods (Frank & Wolfe, 1956) owe their popularity due to their superiority in efficiency over other solution methods, for example active set (Wolfe, 1959) and interior points methods (Nesterov & Nemirovskii, 1994) . The alternating direction method of multipliers (ADMM) (Gabay & Mercier, 1976; Mathematique et al., 2004) is commonly used for returning high solution quality with relatively small computational expense (Stellato et al., 2020b) . Even though ADMM shows satisfactory results in various applications, its convergence speed is highly dependent on both parameters of QP and user-given step-size ρ. In an attempt to resolve these issues, numerous studies have proposed heuristic (Boyd et al., 2011; He et al., 2000; Stellato et al., 2020a) or theory-driven (Ghadimi et al., 2014) ) methods for deciding ρ. But a strategy for selecting the best performing ρ still needs to be found (Stellato et al., 2020b) . Usually ρ is tuned in a case-dependent manner (Boyd et al., 2011; Stellato et al., 2020a; Ghadimi et al., 2014) . Instead of relying on hand-tuning ρ, a recent study (Ichnowski et al., 2021) utilizes reinforcement learning (RL) to learn a policy that adaptively adjusts ρ to accelerate the convergence of ADMM. They model the iterative procedure of ADMM as the Markov decision process (MDP) and apply the generic RL method to train the policy that maps the current ADMM solution states to a scalar value of ρ. This approach shows relative effectiveness over the heuristic method (e.g., OSQP (Stellato et al., 2020a) ), but it has several limitations. It uses a scalar value of ρ that cannot adjust ρ differently depending on individual constraints. And, it only considers the current state without its history to determine ρ, and therefor cannot capture the non-stationary aspects of ADMM iterations. This method inspired us to model the iteration of ADMM as a non-stationary networked system. We developed a more flexible and effective policy for adjusting ρ for all constraints simultaneously and considering the evolution of ADMM states. In this study, we propose Context-aware Adaptive ADMM (CA-ADMM), an RL-based adaptive ADMM algorithm, to increase the convergence speed of ADMM. To overcome the mentioned limitations of other approaches, we model the iterative solution-finding process of ADMM as the MDP whose context is determined by QP structure (or parameters). We then utilize a graph recurrent neural network (GRNN) to extract (1) the relationship among the primal and dual variables of the QP problem, i.e., its spatial context and (2) the temporal evolutions of the primal and dual variables, i.e., its temporal context. The policy network then utilizes the extracted spatio-temporal context to adjust ρ. From the extensive numerical experiments, we verified that CA-ADMM adaptively adjusts ρ in consideration of QP structures and the iterative stage of the ADMM to accelerate the convergence speed further. We evaluated CA-ADMM in various QP benchmark datasets and found it to be significantly more efficient than the heuristic and learning-based baselines in number of iterations until convergence. CA-ADMM shows remarkable generalization to the change of problem sizes and, more importantly, benchmark datasets. Through the ablation studies, we also confirmed that both spatial and temporal context extraction schemes are crucial to learning a generalizable ρ policy. The contributions of the proposed method are summarized below: • Spatial relationships: We propose a heterogeneous graph representation of QP and ADMM state that captures spatial context and verifies its effect on the generalization of the learned policy. • Temporal relationships: We propose to use a temporal context extraction scheme that captures the relationship of ADMM states over the iteration and verifies its effect on the generalization of the learned policy. • Performance/Generalization: CA-ADMM outperforms state-of-the-art heuristics (i.e., OSQP) and learning-based baselines on the training QP problems and, more importantly, out-of-training QP problems, which include large size problems from a variety of domains.

2. RELATED WORKS

Methods for selecting ρ of ADMM. In the ADMM algorithm, step-size ρ plays a vital role in determining the convergence speed and accuracy. For some special cases of QP, there is a method to compute optimal ρ (Ghadimi et al., 2014) . However, this method requires linear independence of the constraints, e.g., nullity of A is nonzero, which is difficult to apply in general QP problems. Thus, various heuristics have been proposed to choose ρ (Boyd et al., 2011; He et al., 2000; Stellato et al., 2020a) . Typically, the adaptive methods that utilize state-dependent step-size ρ t show a relatively faster convergence speed than non-adaptive methods. He et al. (2000) ; Boyd et al. (2011) suggest a rule for adjusting ρ t depending on the ratio of residuals. OSQP (Stellato et al., 2020a) extends the heuristic rule by adjusting ρ with the values of the primal and dual optimization variables. Even though OSQP shows improved performance, designing such adaptive rules requires tremendous effort. Furthermore, the designed rule for a specific QP problem class is hard to generalize to different QP classes having different sizes, objectives and constraints. Recently, Ichnowski et al. (2021) employed RL to learn a policy for adaptively adjusting ρ depending on the states of ADMM iterations. This method outperforms other baselines, showing the potential that an effective rule for adjusting ρ can be learned without problem-specific knowledge using data. However, this method still does not sufficiently reflect the structural characteristics of the QP and the temporal evolution of ADMM iterations. Both limitations make capturing the proper problem context challenging, limiting its generalization capability to unseen problems of different sizes and with alternate objectives and constraints. Graph neural network for optimization problems. An optimization problem comprises objective function, decision variables, and constraints. When the optimization variable is a vector, there is typically interaction among components in the decision vector with respect to an objective or constraints. Thus, to capture such interactions, many studies have proposed to use graph representation to model such interactions in optimization problems. Gasse et al. (2019) expresses mixed integer programming (MIP) using a bipartite graph consisting of two node types, decision variable nodes, and constraint nodes. They express the relationships among variables and the relationships between decision variables and constraints using different kinds of edges whose edge values are associated with the optimization coefficients. Ding et al. (2019) extends the bipartite graph with a new node type, 'objective'. The objective node represents the linear objective term and is connected to the variable node, with the edge features and coefficient value of the variable component in the objective term. They used the tripartite graph as an input of GCN to predict the solution of MIP. ML accelerated scientific computing. ML models are employed to predict the results of complex computations, e.g., the solution of ODE and PDE, and the solution of optimization problems, and trained by supervised learning. As these methods tend to be sample inefficient, other approaches learn operators that expedite computation speeds for ODE (Poli et al., 2020; Berto et al., 2021) , fixed point iterations (Bai et al., 2021; Park et al., 2021b) , and matrix decomposition (Donon et al., 2020) . Poli et al. (2020) uses ML to predict the residuals between the accurate and inaccurate ODE solvers and use the predicted residual to expedite ODE solving. Bai et al. (2021) ; Park et al. (2021b) trains a network to learn alternate fixed point iterators, trained to give the same solution while requiring less iterations to converge. However, the training objective all the listed methods is minimizing the network prediction with the ground truth solution, meaning that those methods are at risk to find invalid solutions. CA-ADMM learns to provide the proper ρ that minimize ADMM iterations, rather than directly predicting the solution of QP. This approach, i.e., choose ρ for ADMM, still guarantees that the found solution remains equal to the original one.

3. BACKGROUND

Quadratic programming (QP) A quadratic program (QP) associated with N variables and M constraints is an optimization problem having the form of the following: min x 1 2 x ⊤ P x + q ⊤ x subject to l ≤ Ax ≤ u, where x ∈ R N is the decision variable, P ∈ S N + is the positive semi-definite cost matrix, q ∈ R N is the linear cost term, A ∈ R M ×N is a M × N matrix that describes the constraints, and l and u are their lower and upper bounds. QP generally has no closed-form solution except some special cases, for example A = 0. Thus, QP is generally solved via some iterative methods. First-order QP solver Out of the iterative methods to solve QP, the alternating direction method of multipliers (ADMM) has attracted considerable interest due to its simplicity and suitability for various large-scale problems including statistics, machine learning, and control applications. (Boyd et al., 2011) . ADMM solves a given QP(P , q, l, A, u) through the following iterative scheme. At each step t, ADMM solves the following system of equations: P + σI A T A diag(ρ) -1 x t+1 ν t+1 = σx t -q z t -diag(ρ) -1 y t , where σ > 0 is the regularization parameter that assures the unique solution of the linear system, ρ ∈ R M is the step-size parameter, and diag(ρ) ∈ R M ×M is a diagonal matrix with elements ρ. By solving Eq. ( 2), ADMM finds x t+1 and ν t+1 . Then, it updates y t and z t with the following equations: zt+1 ← z t + diag(ρ) -1 (ν t+1 -y t ) (3) z t+1 ← Π(z t + diag(ρ) -1 y t ) (4) y t+1 ← x t + diag(ρ)(z t+1 -z t+1 ), ( ) where Π is the projection onto the hyper box [l, u] . ADMM proceeds the iteration until the primal r primal t = Ax t -z t ∈ R M and dual residuals r dual t = P x t + q + A ⊤ y t ∈ R N are sufficiently small. For instance, the termination criteria are as follows: ||r primal t || ∞ ≤ ϵ primal ,||r dual t || ∞ ≤ ϵ dual , where ϵ primal and ϵ dual are sufficiently small positive numbers. 

4. METHOD

In this section, we present Context-aware Adaptive ADMM (CA-ADMM) that automatically adjusts ρ parameters considering the spatial and temporal contexts of the ADMM to accelerate the convergence of ADMM. We first introduce the contextual MDP formulation of ADMM iterations for solving QP problems and discuss two mechanisms devised to extract the spatial and temporal context from the ADMM iterations.

4.1. CONTEXTUAL MDP FORMULATION

As shown in Eqs. (2) to ( 5), ADMM solves QP by iteratively updating the intermediate variables x t ,y t , and z t . Also, the updating rule is parameterized with QP variables P , q, l, A, u. We aim to learn a QP solver that can solve general QP instances whose structures are different from QPs used for training the solver. As shown in Eqs. (2) to (5), ADMM solves QP by iteratively updating the intermediate variables x t , y t , and z t . Also, the updating rule is parameterized with QP variables P , q, l, A, u. In this perspective, we can consider the iteration of ADMM for each QP as a different MDP whose dynamics can be characterized by the structure of QP and the ADMM's updating rule. Based on this perspective, our objective, learning a QP solver that can solve general QP instances, is equivalent to learning an adaptive policy (ADMM operator) that can solve a contextual MDP (QP instance with different parameters). Assuming we have a QP(P , q, A, l, u), where P , q, A, l, u are the parameters of the QP problem, and ADMM(ϕ), where ϕ is the parameters of the ADMM algorithm. In our study, ϕ is ρ, but it can be any configurable parameter that affects the solution updating procedure. We then define the contextual MDP M = X, U, T c ρ , R, γ), where X = {(x t , y t , z t )} t=1,2,... is the set of states, U = {ρ t } t=1,2,... is the set of actions, T c ρ (x, x ′ ) = P (x t+1 = x ′ |x t = x, ρ t = ρ, c ) is the transition function whose behaviour is contextualized with the context c = {P , q, A, l, u, ϕ}, R is the reward function that returns 0 when x t is the terminal state (i.e., the QP problem is solved) and -1 otherwise, and γ ∈ [0, 1) is a discount factor. We set an MDP transition made at every 10 ADMM steps to balance computational cost and acceleration performance. That is, we observe MDP states at every 10 ADMM steps and change ρ t ; thus, during the ten steps, the same actions ρ t are used for ADMM. As shown in the definition of T c ρ , the context c alters the solution updating procedure of ADMM (i.e., T c ρ ̸ = T c ′ ρ to the given state x). Therefore, it is crucial to effectively process the context information c in deriving the policy π θ (x t ) to accelerate the ADMM iteration. One possible approach to accommodate c is manually designing a feature that serves as a (minimal) sufficient statistic that separates T c ρ from the other T c ′ ρ . However, designing such a feature can involve enormous efforts; thus, it derogates the applicability of learned solvers to the new solver classes. Alternatively, we propose to learn a context encoder that extracts the more useful context information from not only the structure of a target QP but also the solution updating history of the ADMM so that we can use context-dependent adaptive policy for adjusting ρ. To successfully extract the context information from the states, we consider the relationships among the primal and dual variables (i.e., the spatial relationships) and the evolution of primal/dual variables with ADMM iteration (i.e., temporal relationship). To consider such spatio-temporal correlations, we propose an encoder network that is parameterized via a GNN followed by RNN. Fig. 1 visualizes the overall network architecture.

NETWORK

Heterogeneous graph representation of QP. The primal x t and dual y t variables of QP are related via the QP parameters (P , q, A, l, u). As shown in Fig. 2 , their relationship can be modeled as a heterogeneous graph where the nodes represent the primal and dual variables, and the edges represent the relationship among these variables. As the roles of primal and dual variables in solving QP are distinctive, it necessitates the graph representation that reflects the different variable type information. Regarding that, we represent the relationships of the variables at t-th MDP step with a directed heterogeneous graph G t = (V t , E t ). As the graph construction is only associated with the current t, we omit the step-index t for notational brevity. The node set V consist of the primal V primal and dual node sets V dual . The n-th primal and m-th dual nodes have node feature s primal n and s dual m defined as: s primal n = [log 10 r dual n , log 10 ||r dual || ∞ , 1 r dual n <ϵ dual Encoding ADMM state ], s dual m = [log 10 r primal m , log 10 ||r primal || ∞ , 1 r primal m <ϵ primal , y m , ρ m , min(z m -l m , u m -z m ) Encoding ADMM state 1 equality , 1 inequality Encoding QP problem ], where r primal m and r dual n denotes the m-th and n-th element of primal and dual residuals, respectively, || • || ∞ denotes the infinity norm, 1 r primal m <ϵ primal and 1 r dual n <ϵ dual are the indicators whether the m-th primal and n-th dual residual is smaller than ϵ primal and ϵ dual , respectively. 1 equality and 1 inequality are the indicators whether the m-th constraint is equality and inequality, respectively. The primal and dual node features are design to capture ADMM state information and the QP structure. However, those node features do not contain information of P , q, or A. We encode such QP problem structure in the edges. The edge set E consist of • The primal-to-primal edge set E p2p is defined as {e p2p ij |P ij ̸ = 0 ∀(i, j) ∈ [[1, N ]] × [[1, N ]]} where [[1, N ]] = {1, 2, . . . , N }. i.e., the edge from the i th primal node to j th primal node exists when the corresponding P is not zero. The edge e p2p ij has the feature s p2p ij as [P ij , q i ]. • The primal-to-dual edge set E p2d is defined as {e ij |A ji ̸ = 0 ∀(i, j) ∈ [[1, N ]] × [[1, M ]]}. The edge e p2d ij has the feature s p2d ij as [A ji ]. • The dual-to-primal edge set E d2p is defined as {e ij |A ij ̸ = 0 ∀(i, j) ∈ [[1, M ]] × [[1, N ]]}. The edge e d2p ij has the feature s d2p ij as [A ij ]. Graph preprocessing for unique QP representation. QP problems have the same optimal decision variables for scale transformation of the objective function and constraints. However, such transformations alter the QP graph representation. To make graph representation invariant to the scale transform, we preprocess QP problems by dividing the elements of P , q, a, A, u with problem-dependent constants. The details of the preprocessing steps are given in Appendix A. Grpah embedding with Hetero GNN. The heterogeneous graph G models the various type of information among the primal and dual variables. To take into account such type information for the node and edge embedding, we propose heterogeneous graph attention (HGA), a variant of typeaware graph attention (Park et al., 2021a) , to embed G. HGA employs separate node and edge encoders for each node and edge type. For each edge type, it computes edge embeddings, applies the attention mechanism to compute the weight factors, and then aggregates the resulting edge embeddings via weighted summation. Then, HGA sums the different-typed aggregated edge embeddings to form a type-independent edge embedding. Lastly, HGA updates node embeddings by using the type-independent edge embeddings as an input to node encoders. We provide the details of HGA in Appendix B.

4.3. EXTRACTING TEMPORAL CONTEXT VIA RNN

The previous section discusses extracting context by considering the spatial correlation among the variables within the MDP step. We then discuss extracting the temporal correlations. As verified from numerous optimization literature, deciding ρ based on the multi-stage information (e.g., momentum methods) helps increase the stability and solution quality of solvers. Inspired by that, we consider the time evolution of the ADMM states to extract temporal context. To accomplish that we use an RNN to extract the temporal context from the ADMM state observations. At the t th MDP step, we consider l historical state observations to extract the temporal context. Specifically, we first apply HGA to G t-l , . . . , G t-1 and then apply GRU (Chung et al., 2014) as follows: G ′ t-l+δ = HGA(G t-l+δ ) for δ = 0, 1, 2, • • • , l -1 (9) H t-l+δ = GRU(V ′ t-l+δ , H t-l+δ-1 ) for δ = 1, 2, • • • , l -1 (10) where G ′ t-l+δ is updated graph at t -l + δ, V ′ t-l+δ is the set of updated nodes of G ′ t-l+δ , and H t-l+δ are the hidden embeddings. We define the node-wise context C t at time t as the last hidden embedding H t-1 .

4.4. ADJUSTING ρ t USING EXTRACTED CONTEXT

We parameterize the policy π θ (G t , C t ) via the HGA followed by an MLP. Since action ρ t is not defined at t, we exclude ρ t from the node features of G t and instead node-wise concatenate C t to the node features. For the action-value network Q ϕ (G t , ρ t , C t ), we use G t , ρ t , and C t as in- puts. Q ϕ (G t , ρ t , C t ) has a similar network architecture as π θ (G t , C t ). We train π θ (G t , C t ) and Q ϕ (G t , ρ t , C t ) on the QP problems of size 10∼15 (i.e., N ∈ [10, 15]) with DDPG (Lillicrap et al., 2015) algorithm. Appendix C details the used network architecture and training procedure further. Comparison to RLQP. Our closest baseline RLQP (Ichnowski et al., 2021) is also trained using RL. At step t, it utilizes the state x = ({s dual m } m=1,...,M ), with s dual m = [log 10 ||r primal || ∞ , log 10 ||r dual || ∞ ], y m , ρ m , min(z m -l m , u m -z m ), z m -(Ax) m ]. It then uses an shared MLP π θ (s dual n ) to compute ρ n . We observe that the state representation and policy network are insufficient to capture the contextual information for differentiating MDPs. The state representation does not include information of P and q, and the policy network does not explicitly consider the relations among the dual variables. However, as shown in Eq. ( 2), ADMM iteration results different (x t , ν t ) with changes in P , q and A, meaning that T c ρ also changes. Therefore, such state representations and network architectures may result in an inability to capture contextual 1 0 1 5 1 5 1 0 0 1 0 0 2 0 0 2 0 0 3 0 0 3 0 0 4 0 0 4 0 0 5 0 0 5 0 0 6 0 0 6 0 0 7 0 0 7 0 0 8 0 0 8 0 0 9 0 0 9 0 0 1 0 0 0 Random QP size changes among QP problems. Additionally, using single state observation as the input for π θ may also hinder the possibility of capturing temporal context.

5. EXPERIMENTS

We evaluate CA-ADMM in various canonical QP problems such as Random QP, EqQP, Porfolio, SVM, Huber, Control, and Lasso. We use RLQP, another learned ADMM solver (Ichnowski et al., 2021) , and OSQP, a heuristic-based solver (Stellato et al., 2020a) , as baselines. Training and testing problems are generated according to Appendix D. In-domain results. It is often the case that a similar optimization is repeatedly solved with only parameter changes. We train the solver using a particular class of QP instances and apply the trained solver to solve the same class of QP problems but with different parameters. We use the number of ADMM iterations until it's convergence as a performance metric for quantifying the solver speed. Table 1 compares the averages and standard deviations of the ADMM iterations required by different baseline models in solving the QP benchmark problem sets. As shown in Table 1 , CA-ADMM exhibits 2 ∼ 8x acceleration compared to OSQP, the one of the best heuristic method. We also observed that CA-ADMM consistently outperforms RLQP with a generally smaller standard deviations. We evaluate the size-transferability and scalability of CA-ADMM by examining whether the models trained using smaller QP problems still perform well on the large QPs. We trained CA-ADMM using small-sized Random QP instances having 10 ∼ 15 variables and applied the trained solve to solve large-scaled Random QP instances having up to 1000 variables. Fig. 3 shows how the number of ADMM iterations varies with the size of QP. As shown in the figure, CA-ADMM requires the least iterations for all sizes of test random QP instances despite having been trained on the smallest setting of 10 ∼ 15. The gap between our method and RLQP becomes larger as the QP size increases. It is interesting to note that OSQP requires relatively few ADMM iterations when the problem size becomes larger. It is possibly because the parameters of OSQP are tuned to solve the medium-sized QPs. We conclude that CA-ADMM learn how to effectively adjust ADMM parameters such that the method can effectively solve even large-sized QP instances. Cross-domain results. We hypothesize that our learned QP solvers can transfer to new domains more easily than other baselines. We also observed that for EqQP and Huber CA-ADMM exhibits better performance than in-domain cases. The experiment results indicate that the context encoder plays a crucial role in transferring to different problem classes. Benchmark QP results. From the above results, we confirm that CA-ADMM outperforms the baseline algorithms in various synthetic datasets. We then evaluate CA-ADMM on the 134 public QP benchmark instances (Maros & Mészáros, 1999) , which have different distributions of generating QP parameters (i.e., P , q, A, l and u). For these tasks, we trained CA-ADMM and RLQP on Random QP, EqQP, Portfolio, SVM, Huber, Control and Lasso. As shown in Table A .8, CA-ADMM exhibits the lowest iterations for 85 instances. These results indicate that CA-ADMM can generalize to the QP problems that follow completely different generating distributions. Application to linear programming. Linear programming (LP) is an important class of mathematical optimization. By setting P as the zero matrix, a QP problem becomes an LP problem. To further understand the generalization capability of CA-ADMM, we apply CA-ADMM trained with 10 ≤ N < 15 Random QP instances to solve 100 random LP of size 10 ≤ N < 15. As shown in Table A .7, we observed that CA-ADMM shows ∼ 1.75 and ∼ 30.78 times smaller iterations than RLQP and OSQP while solving all LP instances within 5000 iterations. These results again highlight the generalization capability of CA-ADMM to the different problem classes. Computational times. We measure the performance of solver with the number of iteration to converge (i.e., solve) as it is independent from the implementation and computational resources. However, having lower computation time is also vital. With a desktop computer equips an AMD Threadripper 2990WX CPU and Nvidia Titan X GPU, we measure the compuational times of CA-ADMM, OSQP on GPU and OSQP on CPU. At each MDP step, CA-ADMM and the baseline algorithms compute ρ via the following three steps: (1) Configuring the linear system and solving itlinear system solving, (2) constructing the state (e.g., graph construction) -state construction, and (3) computing ρ from the state -ρ computation. We first measure the average unit computation times (e.g., per MDP step computational time) of the three steps on the various-sized QP problems. As shown in Table A .5, the dominating factor is linear system solving step rather than state construction and ρ computation steps. We then measure the total computation times of each algorithm. As shown Table A .6, when the problem size is small, CA-ADMM takes longer than other baselines, especially OSQP. However, as the problem size increases, CA-ADMM takes less time than other baselines due to the reduced number of iterations.

6. ABLATION STUDY

Effect of spatial context extraction methods. To understand the contribution of spatial and temporal context extraction schemes to the performance of our networks, we conduct an ablation study. The variants are HGA that uses spatial context extraction schemes and RLQP that does not use a spatial context extracting scheme. As shown in Table 3 , the models using the spatial context (i.e., CA-ADMM 

Effect of temporal context extraction methods.

To further investigate the effects of temporal context, we evaluate CA-ADMM, HGA that exclude GRU (i.e., temporal context extraction scheme) from our model. We also consider RLQP. Table 3 shows that for N ≤ 15 HGA produces better performance than CA-ADMM, while for N ≥ 15 the opposite is true. We conclude that the temporalextraction mechanism is crucial to generalizing to larger QP problems. Qualitative analysis. Additionally, we visualize the ρ t of CA-ADMM and HGA during solving a QP problem size of 2,000. As shown in Fig. 4 , both of the models gradually decrease ρ t over the course of optimization. This behavior is similar to the well-known step-size scheduling heuristics implemented in OSQP. We also observed that ρ t suggested by CA-ADMM generally decreases over the optimization steps. On the other hand, HGA shows the plateau on scheduling ρ t . We observed similar patterns from the different QP instances with different sizes. Effect of MDP step interval. We update ρ every N = 10 ADMM iteration. In principle, we can adjust ρ at every iteration while solving QP. In general, a smaller N allows the algorithm to change ρ more often and potentially further decrease the number of iterations. However, it is prone to increase the computation time due to frequent linear system solving (i.e., solving Eq. ( 1)). Having higher N shows the opposite tendency in the iteration and computational time trade-off. Table Table A .9 summarizes the effect of N on the ADMM iterations and computational times. From the results, we confirm that N = 10 attains a nice balance of iterations and computational time.

7. CONCLUSION

To enhance the convergence property of ADMM, we introduced CA-ADMM, a learning-based policy that adaptively adjusts the step-size parameter ρ. We employ a spatio-temporal context encoder that extracts the spatial context, i.e., the correlation among the primal and dual variables, and temporal context, i.e., the temporal evolution of the variables, from the ADMM states. The extracted context contains information about the given QP and optimization procedures, which are used as input to the policy for adjusting ρ for the given QP. We empirically demonstrate that CA-ADMM significantly outperforms the heuristics and learning-based baselines and that CA-ADMM generalizes to the different QP class benchmarks and large-size QPs more effectively. Bartolomeo Stellato, Goran Banjac, Paul Goulart, Alberto Bemporad, and Stephen Boyd. Osqp: An operator splitting solver for quadratic programs. Mathematical Programming Computation, 12 (4):637-672, 2020b. Philip Wolfe. The simplex method for quadratic programming. Econometrica, 27:170, 1959 .

A GRAPH PREPROCESSING

In this section, we provide the details of the graph preprocessing scheme that imposes the scale invariant to the heterograph representation. The preprocessing step is done by scaling the objective function, and the constraints. Algorithm 1 and 2 explain the preprocessing procedures. Algorithm 1: Objective scaling Input: Quadratic cost matrix P ∈ S N + , linear cost terms q ∈ R N Output: Scaled quadratic cost matrix P ′ ∈ S N + , scale linear cost terms q ′ ∈ R N 1 for n = 1, 2, ... do 2 p * = max(max n (|P nn |), 2 max (i,j),i̸ =j |P ij |) // Compute scaler 3 P ′ = P /p * 4 q ′ = q/p * 5 end Algorithm 2: Constraints scaling Input: constraint matrix A ∈ R M ×N , lower bounds l ∈ R M , upper bound u ∈ R M . Output: Scaled constraint matrix A ′ ∈ R M ×N , Scaled lower bounds l ′ ∈ R M , Scaled upper bound u ′ ∈ R M . 6 for m = 1, 2, ... do 7 a * m = max n (|A mn |) // Compute scaler for each row 8 A ′ m = A m /a * m 9 l ′ m = l m /a * m 10 u ′ m = u m /a * m 11 end

B DETAILS OF HGA

In this section, we provide the details of the HGA layer. HGA is composed of edge update, edge aggregation, node update and hetero node aggregation to extract the informative context. This is done by considering each edge type and computing the updated node, edge embedding. We denote source and destination nodes with the index i and j and the edge type with k. h i and h j denote the source and destination node embedding and h k ij denotes the embedding of the k type edge between the node i and j. Edge update HGA computes the edge embedding h ′ ij and the corresponding attention logit z ij as: h ′ ij = HGA k E ([h i , h j , h k ij ]) (A.1) z ij = HGA k A ([h i , h j , h k ij ]), (A.2) where HGA k E and HGA k A are the hetero edge update function and the hetero attention function for edge type k, respectively. Both functions are parameterized with Multilayer Perceptron (MLP). Edge aggregation HGA aggregates the typed edges as follows: α ij = exp(z k ij ) i∈N k (j) exp(z k ij ) (A.3) where N k (j) = {v i |type of e ij = k, v j ∈ N (i) ∀i} is the set of all neighborhood of v j , which type of edge from the element of the set to v j is k. Second, HGA aggregates the messages and produces the type k message m k j for node v j . Edge type-aware node update The aggregated message with edge type k, m k j , is used to compute the updated node embedding with edge type k, h k j ′ , as: m k j = i∈N k (j) α ij h ′ ij (A.4) h k j = HGA k V ([h j , m k j ]) (A.5) where HGA k V is the hetero node update function, which is composed of MLP. The above three steps (edge update, edge aggregation, edge type-aware node update) are performed separately on each edge type to preserve the characteristics of the edge type and extract meaningful embeddings. Hetero Node update HGA aggregates the updated hetero node feature h k j to produce the updated node embedding h ′ j as follows. First, HGA computes the attention logit d k j corresponding to h k j as: d k j = HGA N (h k j ) (A.6) where HGA N is the hetero node update function, which is composed of MLP. Second, HGA computes attention β k j using the softmax function with d k j : β k j = exp(d k j ) k∈E dst (j) exp(d k j ) (A.7) where E dst (j) is the set of edge types that has a destination node at j. Finally, HGA aggregates the per-type the updated messages to compute updated node feature h ′ j for v j as: h ′ j = k∈E dst (j) β k j h k j (A.8)

C DETAILS OF NETWORK ARCHITECTURE AND TRAINING

In this section, we provide the architecture of pocliy network π θ and action-value network Q ϕ . Policy netowrk π θ architecture. As explained in Section 4.4, the policy network is consist HGA and MLP. We parameterize π θ as follows:  ρ t = MLP θ HGA θ CONCAT(G t , C t ) , (A.9) HGA p2p V HGA p2d V HGA d2p V HGA N [64, 32] where MLP θ is two-layered MLP with the hidden dimension [64, 32] , the LeakyReLU hidden activation, the ExpTanh (Eq. (A.10)) last activation, and HGA θ is a HGA layer. ExpTanh(x) = (tanh(x) + 1) × 3 + (-3) (A.10) Q network Q π architecture. As mentioned in Section 4.4, Q π has similar architecture to π θ . We parameterize Q π as follows: ρ t = MLP ϕ READOUT HGA ϕ CONCAT(G t , C t , ρ t ) , (A.11) where MLP ϕ is two-layered MLP with the hidden dimension [64, 32] , the LeakyReLU hidden activation, the Identity last activation, HGA ϕ is a HGA layer, and READOUT is the weighted sum, min, and amx readout function that summarizes node embeddings into a single vector. Training detail. We train all models with mini-batches of size 128 for 5,000 epochs by using Adam with the fixed learning rate. For π θ , we set learning rate as 10 -4 and, for Q ϕ , we set learning rate as 10 -3 . We set history length l as 3.

D DETAILS OF PROBLEM GENERATION

In this section, we provide the generation scheme of problem classes in OSQP, which is used to train and test our model. Every problem class is based on OSQP (Stellato et al., 2020a ), but we modified some settings, such as the percentage of nonzero elements in the matrix or matrix size, to match graph sizes among the problem classes. D.1 RANDOM QP min x 1 2 x ⊤ P x + q ⊤ x subject to l ≤ Ax ≤ u, (A.12) Problem instances We set a random positive semidefinite matrix P ∈ R n×n by using matrix multiplication on random matrix M and its transpose, whose element M ij ∼ N (0, 1) has only 15% being nonzero elements. We generate the constraint matrix x ∈ R m×n with A ij ∼ N (0, 1) with only 45% nonzero elements. We chose upper bound u with u i ∼ (Ax) i + N (0, 1), where x is randomly chosen vector, and the upper bound l i is same as -∞. D.2 EQUALITY CONSTRAINED QP min x 1 2 x ⊤ P x + q ⊤ x subject to l = Ax = u, (A.13) Problem instances We set a random positive semidefinite matrix P ∈ R n×n by using matrix multiplication on random matrix M and its transpose, whose element M ij ∼ N (0, 1) has only 30% of being nonzero elements. We generate the constraint matrix A ∈ R m×n with A ij ∼ N (0, 1) with only 75% nonzero elements. We chose upper bound u with u i ∼ (Ax) i , where x is randomly chosen vector, and the upper bound l i ∈ l is same as u i . D.3 PORTFOLIO OPTIMIZATION min x 1 2 x ⊤ Dx + 1 2 y ⊤ y - 1 2γ µ ⊤ x subject to y = F ⊤ x, 1 ⊤ x = 1, x ≥ 0, (A.14) Problem instances We set the factor loading matrix F ∈ R n×m , whose element F ij ∼ N (0, 1) as 90% nonzero elements. We generate a diagonal matrix D ∈ R n×n , whose element D ii ∼ C(0, 1) × √ m is the asset-specific risk. We chose a random vector µ ∈ R n , whose element µ i ∼ N (0, 1) and is the expected returns.

D.4 SUPPORT VECTOR MACHINE (SVM)

min x x ⊤ x + λ1 ⊤ t subject to t ≥ diag(b)Ax + 1, t ≥ 0, (A.15) Problem instances We set the matrix A ∈ R m×n , which has elements A ij as follows: A ij = N ( 1 n , 1 n ), i ≤ m 2 N (-1 n , 1 n ), otherwise We generate the random vector b ∈ R m ahead as follows: b i = +1, i ≤ m 2 -1, otherwise we choose the scalar λ as 1 2 . D.5 HUBER FITTING min x u ⊤ u + 2M 1 ⊤ (r + s) subject to Ax -bu = r -s, r ≥ 0, s ≥ 0, (A.16) Problem instances We set the matrix A ∈ R m×n with element A ij ∼ N (0, 1) to have 50% nonzero elements. To generate the vector b ∈ R m , we choose the random vector v ∈ R n ahead as follows: v i = N (0, 1 ), with probability p = 0.95 U[0, 10], otherwise Then, we set the vector b = A(v + ϵ), where ϵ i ∼ N (0, 1 n ).  Relation m & n m = 3 × n m = ⌊ n 2 ⌋ m = 10 × n m = ⌊ n 10 ⌋ m = 10 × n m = ⌊ n 2 ⌋ m = 10 × n m = ⌊ 7n 3 ⌋ D.6 OPTIMAL CONTROL min x x ⊤ T Q T x T + T -1 t=0 x ⊤ t Qx t + u ⊤ t Ru t subject to x t+1 = Ax t + Bu t , x t ∈ X , u t ∈ U, x 0 = x init , (A.17) Problem instances We set the dynamic A ∈ R n×n as A = I + ∆, where ∆ ij ∼ N (0, 0.01). We generate the matrix B ∈ R n×m , whose element B ij ∼ N (0, 1). We choose state cost mQ as diag(q), where q i ∼ U(0, 10) with 30% zeros elements in vq. We generate the input cost R as 0.1I. We set time T as 5.

D.7 LASSO

min x 1 2 y ⊤ y + γ1 ⊤ t subject to y = Ax -b, -t ≤ x ≤ t, (A.18) Problem instances We set the matrix A ∈ R m×n , whose element A ij ∼ N (0, 1) with 90% nonzero elements. To generate the vector b ∈ R m , we set the sparse vector v ∈ R n ahead as follows: v i = 0, with probability p = 0.5 N (0, 1 n ), otherwise Then, we chose the vector b = Av + ϵ where ϵ i ∼ N (0, 1). We set the weighting parameter γ as 1 5 ||A ⊤ b|| ∞ . For the seven problem types mentioned above, the range of variables n and m can be referred to in the Table A .3. The relationship between n and m is also described to facilitate understanding. D.8 ENTIRE RANDOM QP min x 1 2 x ⊤ P x + q ⊤ x subject to l ≤ Ax ≤ u, Bx = b, (A.19) Problem instances We set a random positive semidefinite matrix P ∈ R n×n by using matrix multiplication on random matrix M and its transpose, whose element M ij ∼ N (0, 1) has only 15% being nonzero elements. We generate the inequality constraint matrix A ∈ R m1×n with A ij ∼ N (0, 1) with only 60% nonzero elements. We chose upper bound u ∈ R m1 with u i ∼ (Ax) i + N (0, 1), where x is a randomly chosen vector and l i as -∞. We generate the equality constraint matrix B ∈ R m2×n with B ij ∼ N (0, 1) with only 60% nonzero elements. We set constant b ∈ R m2 , with b i = (Bx) i , where x is a randomly chosen vector at the generating process of vector u. We set m 1 , and m 2 as ⌊ m 7 ⌋ and ⌊ 6m 7 ⌋, respectively. 



Figure 2: Heterograph representation of QP.

Figure 3: QP size generalization results. Smaller is better (↓). We plot the average and standard deviation of the iterations as the bold lines and error bars, respectively. The gray area ( ) indicates the training QP sizes.

In-domain results (in # iterations). Smaller is better (↓). Best in bold. We measure the average and standard deviation of ADMM iterations for each QP benchmark. All QP problems are generated to have 10 ≤ N ≤ 15.

Cross-domain results. zero-shot transfer (Random QP + EqQP) → X, where X ∈ {Random QP

Smaller is better (↓). Best in bold. We measure the average and standard deviation of ADMM iterations for each QP benchmark. The gray colored cell ( ) denotes the training problems. Green colored values are the case where the transferred model outperforms the in-domain model. All QP problems are generated to have 40 ≤ N + M ≤ 70.

Ablation study results. Best in bold. We measure the average and standard deviation of ADMM iterations. Graph repr. HGA GRU 10 ∼ 15 15 ∼ 100 100 ∼ 200 200 ∼ 300 300 ∼ 400 400 ∼ 500 500 ∼ 600 600 ∼ 700 700 ∼ 800 800 ∼ 900 900 ∼ 1000 Example of log 10 ρ t over MDP steps on a QP of N = 2, 000 and M = 6000. The gray lines visualize the time evolution of log 10 ρ t . The black lines and markers visualize the averages of log 10 ρ t over the dual variables. CA-ADMM terminates at the 6-th MDP (i.e., solving QP in 60 ADMM steps). HGA takes additional two MDP steps to solve the given QP.



2:The MLP sturcture of functions in HGA's n th layer(n ≥ 2)

3: Range of n, m which uses for each problem class generation

4: QP size generalization results in table format. Smaller is better (↓). Best in bold. We measure the average and standard deviation of the iterations for each QP sizes. Problem size (N ) 10 ∼ 15 15 ∼ 100 100 ∼ 200 200 ∼ 300 300 ∼ 400 400 ∼ 500 500 ∼ 600 600 ∼ 700 700 ∼ 800 800 ∼ 900 900 ∼ 1000 Training data generation For the training set generation, the problems for each class are generated randomly with instance generating rules in (Appendix D). Each problem's size of QP is described in Table A.3. Evaluation data generation Table 1's evaluation data generated in the same way as training data generation. To construct empty intersection between train and test set, each set is generated from a different random seed. To verify the QP size generalization test (Fig. 3, Table A.4), problems are generated randomly for the problem class Random QP with dimension of x in [100, 1000]. Additional results We provide the following Table A.4 to show the numerical results difficult to visualize in Fig. 3. Training data generation We create a new problem class named Entire Random QP that contains both the randomly generated inequality and equality constraints as there is no problem class in OSQP. The form of the QP problem class and instance generating rule are described in Appendix D.8. With the new problem class, we generate the training set with size in Table A.3. Evaluation data generation For the training set generation, the problems for each class are generated randomly with instance generating rules in (Appendix D). the size of each problem is described in Table A.3.

7: LP experiment results.: Smaller is better (↓). Best in bold. We measure the average and standard deviation of the iterations and average total computational time for Random LP problems. Solve ratio is ratio between solved problems and total generated problems.F.4 RESULT FOR MAROS & MESZAROS PROBLEMSTable A.8: QP-benchamrk results.(Random QP + EqQP+ Portfolio+ SVM+ Huber+ Control+ Lasso) → X, where X ∈ {Maros & Meszaros}: Smaller is better (↓). Best in bold. We measure the iterations and the Total Computation time for each QP benchmark.

9:Step interval ablation results.: Smaller is better (↓). Best in bold. We measure the average and standard deviation of the iterations for each step-intervals.

F ADDITIONAL EXPERIMENTS F.1 RESULT FOR UNIT COMPUTATIONAL TIME OF THE SIZE GENERALIZATION EXPERIMENT

Table A.5: Unit computational time results for each QP sizes in table format. Smaller is better (↓). Best in bold. We measure the average of the unit time per a MDP step for each QP sizes. The unit of every element is Second. 

