HOMOTOPY LEARNING OF PARAMETRIC SOLUTIONS TO CONSTRAINED OPTIMIZATION PROBLEMS

Abstract

Building deep learning (DL) alternatives to constrained optimization problems has been proposed as a cheaper solution approach than classical constrained optimization solvers. However, these approximate learning-based solutions still suffer from constraint violations. From this perspective, reaching a reliable convergence remains an open challenge to DL models even with state-of-the-art methods to impose constraints, especially when facing a large set of nonlinear constraints forming a non-convex feasible set. In this paper, we propose the use of homotopy meta-optimization heuristics which creates a continuous transformation of the objective and constraints during training, to promote a more reliable convergence where the solution feasibility can be further improved. The method developed in this work includes 1) general-purpose homotopy heuristics based on the relaxation of objectives and constraint bounds to enlarge the basin of attraction and 2) physics-informed transformation of domain problem leading to trivial starting points lying within the basin of attraction. Experimentally, we demonstrate the efficacy of the proposed method on a set of abstract constrained optimization problems and real-world power grid optimal power flow problems with increasing complexity. Results show that constrained deep learning models with homotopy heuristics can improve the feasibility of the resulting solutions while achieving near-optimal objective values when compared with non-homotopy counterparts.

1. INTRODUCTION

Recent years have seen a rich literature of deep learning (DL) models for solving constrained optimization problems on real-world tasks such as power grid, traffic, or wireless system optimization. These applications can largely benefit from data-driven alternatives enabling fast real-time inference. The problems remain that these problems commonly include a large set of nonlinear system constraints that lead to non-convex parametric nonlinear programming (pNLP) problems which are NP-hard. Earlier attempts simply adopt imitation learning (i.e., supervised learning) to train function approximators via a minimization of the prediction error using labeled data of pre-computed solutions. Unfortunately, these models can hardly perform well on unseen data as the outputs are not trained to satisfy physical constraints, leading ifeasible solutions. To address the feasibility issues, existing methods have explored the imposing of constraints on the output space of deep learning models. Section 2 provides an overview of the existing techniques. The imposing of constraints has inspired the use of end-to-end learning approaches that directly consider the original objectives and constraints in the NN training process without the need of expert labeled data. However, even the state-of-the-art methods to impose constraints can hardly guarantee a reliable convergence with perfect feasibility on unseen data for large problems. Penalty method which treats the constraints as a form of regularization requires careful selection of penalty weights and such soft-constraint treatment cannot guarantee satisfying constraints to machine precision. Primal-dual Lagrangian-based formulation theoretically provides a hard constraint methodology, whereas empirical evidence indicate it can perform worse than penalty method (the reason remains unclear, see (Márquez-Neila et al., 2017) ). Another strategy (Donti et al., 2021) adds completion layer after the NN model to reconstruct the complete solution from an incomplete one given by the NN, using the equality constraints. This enables a hard constraint method for equality constraints, whereas when facing nonlinear constraints, the completion layer, as an iterative solver, adds to the computation complexity, and can potentially diverge when a bad incomplete output from NN causes a non-existence of feasible solution to be reconstructed. Due to the lack of consensus in the community, these new approaches are often called by different names such as constrained deep learning, end-to-end neural networks, differentiable optimization layers, or deep declarative networks. In this paper we contribute to this diversity by referring to the proposed method as differentiable parametric programming (DPP) to emphasize the connection with sensitivity analysis developed in the context of operations research (Gal & Nedoma, 1972; Gal & Greenberg) and later adopted in control theory applications (Bemporad et al., 2000; Herceg et al., 2013) . As a main contribution, we present a novel method that combines homotopy, deep learning and parametric programming formulations into one coherent algorithmic framework. The aim is to obtain a more reliable convergence of constrained deep learning models whose solution feasibility can be further improved. Homotopy based meta-optimization heuristics are developed to create a continuous transformation of objective and constraint sets, making a homotopy path that drives the training of NN to gradually learn from easy problems to harder problems. Our contribution includes 2 types of homotopy heuristics which are different ways of utlizing the basin of attraction: 1) homotopy heuristics based on relaxation of objective and constraints to manipulate the basin of attraction, 2) domain-aware homotopy heuristics based on physics-informed transformation of the problem to make it available trivial starting points within the basin of attraction 2 RELATED WORK

2.1. CONSTRAINED NEURAL NETWORKS

Imposing constraints onto the output space of NNs can be done via supervised learning (where labels are used to write the constraints) or unsupervised learning; using either soft constraint (which usually treats the constraints as a regularization) or hard constraint method (which usually means enforcing satisfaction of constraints to machine precision, i.e., perfect satisfaction). We briefly describe the different categories of existing methods, according to the type of constraints to be imposed: General equality and inequality constraints can be imposed by augmented objective functions, reprojection (as hard constraints), completion layer (as hard constraints), etc. Among augmented objective function methods, penalty method (Yang et al., 2019; Hu et al., 2020; Donti et al., 2021; Pan et al., 2019) augments the objective function by additional terms that penalize the violation of constraints, treating the constraints as a regularization with pre-defined weights to control the regularization strength, whereas the primal-dual based formulation (or lagrangian formulation) (Nandwani et al., 2019; Fioretto et al., 2020; Márquez-Neila et al., 2017) exploits Lagrangian duality and iteratively updates both primal and dual variables to minimize a Lagrangian loss function. Penalty method, as a soft constraint method, has some theoretical deficits of requiring extra weight tuning for the multi-objective loss function, and no guarantee of satisfying constraints. However, evidence (Márquez-Neila et al., 2017) has shown that Lagriangian formulation, as a hard constraint method, is empirically worse. Reprojection method makes corrections on out-of-constraint-set outputs by projecting them onto the feasible region, either during the training cycle using different variants of projected gradient descent methods (Donti et al., 2021; Márquez-Neila et al., 2017) , or during the test as a post-processing step (e.g., (Pan et al., 2019) passed outputs to a physical equation solver). A completion layer method (e.g., DC3(Donti et al., 2021) , ACnet (Beucler et al., 2021) ) developed NN to only produce a subset of the target output variables, and then an extra constraint layer attached after NN computes the remaining outputs according to constraints. These methods have pros and cons, as discussed in Section 1. Domain-specific constraints. Some real-world applications work with graphical structures, necessitating the encoding of network topology constraints i) in model architecture as a hard constraint (e.g., Graph Neural Network (GNN) (Kundacina et al., 2022; Donon et al., 2019; Owerko et al., 2020; Diehl, 2019) and other graphical models (Li et al., 2022) ), ii) in prior as soft constraint (e.g, adjacent matrix as prior (Yang et al., 2019; Hu et al., 2020) , or iii) in input features (topology related NN inputs). More attempts to impose dynamic and recursive constraints include i) unrolled neural networks (soft constraint) where recurrent neural network (RNN) and its variants unroll differentiable dynamic models (Tuor et al., 2022) (Skomski et al., 2021) (Drgoňa et al., 2021) and iterative physical solvers (Zhang et al., 2019) (Yang et al., 2020; Zhang et al., 2018) , or ii) encoding temporal and spatial constraints in latent representations (Yuan et al., 2021) .

2.2. CONSTRAINED OPTIMIZATION

Consider a general constrained optimization problem, with ξ denoting the known parameters representing input data, and x denoting the solution of the corresponding optimization problem. Given a parameter instance ξ, the aim is to obtain optimal x by solving: min x f obj (x, ξ) s.t g(x, ξ) ≤ 0, h(x, ξ) = 0 (1) When there are non-linear objectives or constraints, (1) defines a family of parametric non-linear programming (pNLP) problems. pNLP problems are NP-hard and handled poorly with state-ofthe-art optimization solvers due to the non-convexities in the solution space. Existing works have explored a large number of techniques (Wächter & Biegler, 2005; 2006; Byrd et al., 2000; Liao, 2004) , including filter method, line-search, corrections, trust region method, or homotopy methods, to improve on (local) convergence, and also developed many heuristics to allow faster performance of online optimization solvers. The local (and global) convergence properties of optimization methods have also been extensively studied in order to develop tools to improve on convergence guarantee for nonlinear programming. However, scalable solution approaches for generic pNLP problems remain a challenge. Reasons include the numerical difficulties (caused by uncertain behaviors of initialization procedures, stopping criteria, etc), ill-conditioning, and the assumptions required to ensure convergence becoming easily violated in practice.

2.3. POWER GRID OPTIMIZATION PROBLEM

One domain-specific problem of interest to this paper is the AC optimal power flow (ACOPF) problem (Capitanescu, 2016) (Pandey et al., 2020) which is hard to solve due to non-convexities. ACOPF is the fundamental optimization problem to determine the optimal control of generator output that can meet the demand with maximal cost-efficiency while safely operating the system within its technical limits. A simple definition of ACOPF is given below, where we minimize the generation cost subject to power balance equations and variable bounds: min x=[V real ,V imag ,Pg,Qg] ng i=1 α i P 2 gi + β i P gi + γ i (2a) s.t. (2b) Power balance: (P g -P d ) + j(Q g -Q d ) = V ⊙ (Y bus V ) * (2c) reference bus angle: V imag ref = 0 (2d) voltage magnitude bounds: |V | - i ≤ |V | i ≤ |V | + i , i ∈ nb (2e) Pgen bounds: P - gi ≤ P gi ≤ P + gi , i ∈ ng (2f) Qgen bounds: Q - gi ≤ Q gi ≤ Q + gi , i ∈ ng where P d , Q d are unknown parameters of load demand (given as input information), α i , β i , γ i are generator cost coefficients of the i-th generator, P gi , Q gi denote real and reactive power output of the i-th generator, V real , V imag denote vector of real and imaginary voltage at all buses, V is a vector of complex bus voltages V = V real + jV imag , |V | denotes the magnitude of voltage, and x + , x - denote the upper and lower variable bound.

2.4. HOMOTOPY METHOD FOR CONSTRAINED OPTIMIZATION

Homotopy method is a type of meta-heuristics to handle hard problems which can otherwise easily divergence or converge to a bad point. It decomposes the original (nonlinear) problem F (x) to a series of sub-problems H(x, λ H ), creating a path of optimizers driven by the change of homotopy parameter λ H . As λ H shifts from 0 to 1, the sub-problem H(x, λ H ) continuously transforms from a simple problem to the original one. Commonly, homotopy-based heuristics have been used for local optimization of nonlinear problems. The most popular form of designing H(x, λ H ) is via a linear combination of a trivial problem H 0 (x) and the original one such that H(x, λ H ) = (1 - λ H )H 0 (x)+λ H F (x). Existing works have developed perturbation techniques for general nonlinear problems (He, 1999; Liao, 2004) , multi-objective problems (Hillermeier et al., 2001) etc) . Therefore, the use of homotopy on classical optimization solvers can still suffer from limited time efficiency.

3. HOMOTOPY LEARNING FOR DIFFERENTIABLE PARAMETRIC PROGRAMMING

Here we present a novel method that combines homotopy, deep learning and parametric programming in one coherent algorithmic framework. Specifically, using neural network as an approximation of constrained optimization solvers allows fast real-time inference for any new input, and meanwhile, integrating homotopy into the training process facilitates the NN to reach a more reliable convergence with improved feasibility.

3.1. DIFFERENTIABLE PARAMETRIC PROGRAMMING

To deal with the challenge regarding scalability in generic pNLP problems, as well as the drawbacks in imitation learning, we can build a data-driven alternative to the traditional optimization solver, with the target objectives and constraints directly integrated within the training cycle. This can be achieved by differentiable parametric programming, which adopts an unsupervised learning of NN model, mathematically defined as: min Θ f obj (x, ξ) s.t. g(x, ξ) ≤ 0, h(x, ξ) = 0, x = π Θ (ξ), ∀ξ ∈ Ξ (3) with π Θ denoting a NN model mapping from input ξ to the output solution x, and Θ being the NN weights. Figure 1 illustrates the difference from imitation learning. One methodology of interest to this paper is the penalty method, which imposes the constraints by reformulating (3) into an unconstrained form, leading to a NN loss function as below: min Θ f obj (π Θ (ξ), ξ) + i w i • P i (h i (π Θ (ξ), ξ)) + i w i • P i (g i (π Θ (ξ), ξ)) with P i () penalizing the violation of constraints, and hyperparameters w denoting the pre-defined penalty weights. Popular selections of P i () include residual norm penalty, typically L2-norm, for equality constraints and ReLU operator for inequality constraints. Beyond these popular forms, work in (Zhu et al., 2019) further explored using variational functionals of partial differential equations (PDE) as penalty terms to impose PDE constraints.

3.2. METHOD OVERVIEW AND THEORETICAL FOUNDATIONS

As discussed earlier, when facing non-convex objective function and constraints in (3), even the state-of-the-art methods can have difficulty reaching a good converge with small violation of constraints. In this paper we develop a novel optimization heuristic for these type of problems based on the idea of homotopy (He, 1999; Liao, 2004) . Specifically, we apply homotopy to the problem by creating a continuous transformation of the objecitve and constraints in equation 3, such that a subproblem in the homotopy path can be expressed as: min Θ f λ H (x, ξ) s.t. g λ H (x, ξ) ≤ 0, h λ H (x, ξ) = 0, x = π Θ (ξ), ∀ξ ∈ Ξ (5) As the homotopy parameter λ H incrementally changes from 0 to 1, the objective and constraints f λ H , g λ H , h λ H gradually return back to f obj , g, h, shifting the task from an easy-to-solve problem to the original one. Driven by this transformation, the neural network model π Θ gradually learns to approximate the solution of harder and harder problems. Next we describe the conceptual idea of our homotopy heuristics along with some theoretical foundations. Let H(x, λ H ) denote a sub-problem in the homotopy path that transforms the original optimization problem F (x) in (3). We can use a certain local minimization method as the solver of H(x, λ H ) to get x * λ H , a local minimizer of H(x, λ H ). For that local minimization method, we can define the basin of attraction B(λ H ) of a local minimizer of H(x, λ H ) as the set of points x such that the local minimization method started at x ∈ B(λ H ) will converge to that minimizer x * Assuming that the original problem F (x) has a global minimizer x * that is unique and isolated. According to studies in (Dunlavy & O'Leary, 2005), the Implicit Function Theorem guarantees that there exists of unique curve of isolated minimizers that passes through (x * , 1), meaning the desirable optimum is theoretically accessible through a curve of homotopy problem minimizers (x * λ H , λ H ). Importantly, when taking small enough homotopy step, i.e., λ (k) H = λ (k-1) H + ∆λ H with small enough ∆λ H , there is a high likelihood that the minimizer of a subproblem H(x, λ • TYPE I: expanding the basin of attraction B(λ H ) to a larger volume in the initial homotopy step H 0 (x). In this case, the initial problem is easy to solve in a way that there exist more choices of starting point that can enable solution trajectories to a (local) minimizer. The basin of attraction gradually shrinks it as λ H increases. • TYPE II, designing H 0 (x) such that a trivial starting point x 0 ∈ B(λ H = 0) is available (before any training starts). In this case, H 0 (x) is easy in a way that we have a starting point within its basin of attraction, which guarantees a solution trajectory to a minimizer. 2. selection of a proper homotopy step ∆λ H that trades off between the likelihood of x * (k-1) ∈ B(λ (k) H ), and an acceptable time complexity. Eventually, the overall training process can be described as Algorithm 1: Algorithm 1 Homotopy Learning of Neural Network 1: Initialize: homotpy parameter λ H ← 0, NN model π Θ 2: while λ H ≤ 1 do 3: 1. Update objective and constraints f λ H , h λ H (•) = 0, g λ H (•) ≤ 0 4: 2. Train NN using penalty method 5: while not converged do for all ξ ∈ Ξ: 6: Forward pass: x ← π Θ (ξ) 7: Get loss: L λ H ← f λ H (x) + i w eq • ||(h λ H (x))|| 2 + i w ineq • relu(g λ H ,i (x)) 8: Backward pass: π Θ ← arg min Θ L λ H 9: end while 10: 3. Update homotpy parameter λ H ← λ H + ∆λ H 11: end while

3.3. TYPE I: RELAXATION BASED HEURISTICS FOR OBJECTIVE AND CONSTRAINTS

Let's first focus on TYPE I homotopy methods which enlarge the basin of attraction to create easily solvable problems at the initial homotopy steps. We propose the use of relaxation based heuristics, which work by convexifying the objective and expanding the feasible constraint set at the beginning of the homotopy process. We briefly illustrate the idea behind using relaxation. Mathematically, consider minimizing a nonconvex smooth objective function f obj subject to a constraint set S, using penalty method (or barrier method, etc) to reformulate the problem. Let L denote the (smooth) augmented loss function, x * denote one of the (local) minimizers x * ∈ S, and let N (x * ) denote a local neighborhood of x * where L is a generic (Lipschitz smooth) convex function. Now consider using gradient descent as the local optimization method with its learning rate lower than twice the smallest optimal learning rate for any component (η < 2 min η i,opt , studies have shown that otherwise learning will diverge), then starting from any x ∈ N (x * ), we can guarantee to reach x * as the final solution. Therefore, N (x * ) can be considered a lower bound of the basin of attraction for x * , i.e., B(x * ) ⊇ N (x * ). As we relax (convexify) the objective function, the resulting augmented loss function L + has higher convexity, making the optimization landscape more convex. Then for a certain local minimizer x * , there's a likelihood that x * has a larger neighborhood N + (x * ) where L + is generic convex, in which case the lower bound of B(x * ) increases: N + (x * ) ⊇ N (x * ) (6) Further when relaxing the constraint set from S to S + , with S ⊆ S + and let χ * (S) denote the set of all local minimizers in the constraint set S, then the expanded constrained set is likely to contain more local minimizers, i.e., χ * (S + ) = χ * (S) ∪ χ * (S + \ S) ⊇ χ * (S) Therefore the total basin of attraction (the set of starting points that will lead to any local minimizer), as the union of B(x * ) for each individual minimizer x * , will expand after relaxation: B + total = ∪ x * ∈χ * (S + ) B(x * ) ⊇ ∪ x * ∈χ * (S + ) B(x * ) = B total (8) Below we introduce the relaxation heuristics. To handle non-convex objective functions during the homotopy process, we apply the heuristics of Convexify Objective (CObj): Convexify Objective (CObj): Given a non-convex objective function f obj , the homotopy path starts from minimizing a convex objective function f cvx (which is close to the original f obj ), and gradually insert non-convexity via a linear combination with the original objective: f λ H = λ H * f obj + (1 -λ H )f cvx For inequality constraints g(•) ≤ 0 during the homotopy process, we propose 2 heuristics: 1) Shrink Bounds (SBnds): In the homotopy process, the constraints transforms by g ≤ (1λ H )ϵ + . As λ H increases from 0 to 1, the inequality constraints gradually transforms from g ≤ ϵ + to the original constraints g ≤ 0. This is intuitive for variable bounds x -≤ x ≤ x + which can be rewritten in the homotopy process as: λ H x -+ (1 -λ H )ϵ -≤ x ≤ λ H x + + (1 -λ H )ϵ + where ϵ + , ϵ -are some pre-defined relaxation of upper and lower bounds making it easier to satisfy. During the homotopy process, the bounds are gradually tightened as λ H increases. 2) Grow Penalty (GPen): Another homotopy heuristic exploits the penalty strength of the violation of constraints. The transformation of constraints can be expressed as: λ H g ≤ 0 such that the increasing λ H leads to a growing penalty of the constraint violations when the constraints are included in the augmented loss function. Finally, we propose the "Split and Shrink" (SaS) heuristics for equality constraints h(•) = 0: Split and Shrink (SaS): Any equality constraint h(•) = 0 can be equivalently split into two inequality constraints 0 ≤ h ≤ 0. These constraints can be relaxed via a perturbation of the bounds: -ϵ ≤ h ≤ ϵ with ϵ > 0. A larger perturbation ϵ makes the constraints easier to satisfy. This motivates us to design a homotopy path where the perturbation gradually decreases. In more detail, with a predefined large perturbation ϵ H and tiny perturbation ϵ L , the split constraints are perturbed by: -ϵ L -(1 -λ H )ϵ H ≤ h ≤ ϵ L + (1 -λ H )ϵ H As the homotopy parameter λ H shifts from 0 to 1, the decreasing perturbation leads to a tighter bounds that enforces h(•) = 0 more closely.

3.4. TYPE II: DOMAIN-AWARE TRANSFORMATION WITH TRIVIAL SOLUTIONS AVAILABLE

This section further explores TYPE II homotopy heuristics which makes it available a trivial starting point within the basin of attraction B(λ H = 0). Specifically, consider minimizing an objective function f obj subject to a constraint set S, we aim to transform the problem via a carefully designed manipulation of the objective (if non-convex) and the constraint functions, such that the purturbed problem H(x, λ H = 0) : min f perturbed , s.t.S purturbed is easy-to-solve in a way that, before any training starts, a trivial starting point x 0 ∈ B(λ H = 0) is available to guarantee a solution trajectory towards a minimizer. Instead of a simple relaxation of the constraint bounds, transforming an entire problem into one which has trivial solutions (or trivial starting points) often requires some domain knowledge. In this paper, we work in the context of power grid, and show the design of 2 domain-specific homotopy heuristics: load-stepping and Tx-stepping to impose highly non-linear equality constraints. We consider the power grid optimization control problem defined in (2) and the power system related symbols used here are also based on the definitions in Section 2.3.

3.4.1. LOAD-STEPPING

The homotopy method of load stepping creates a path of power flow balance constraints induced by a gradual increase of load demand: h λ H = (P g -λ H * P d ) + j(Q g -λ H * Q d ) -v ⊙ (Y bus v) * = 0 From a domain perspective, when λ H = 0, all load demands are zeroed (λ H P d = λ H Q d = 0) and all variable bounds can be removed/relaxed using the heuristics desgined in Section 3.3 for inequalities. A trivial solution to this problem exists: x 0 = [V r , V i , P g , Qg] = 0, representing that the power grid is closed off everywhere with no supply and demand. We make use of this trivial solution via a warm-homotopy loss: l warm = w warm (x -x 0 ) T (x -x 0 ) (14) to guide the update of the deep learning model towards a quick convergence to a feasible point for (only) the first homotopy step. A physics-informed homotopy optimization using penalty method can therefore be written as H(x, λ H ) : min L λ H + I 0 (λ H ) * l warm with I 0 (λ H ) being an indicator function, and L λ H is an augmented loss function as defined in Alogrithm 1.

3.4.2. TX-STEPPING

Unlike load stepping which manipulates load demand, Tx-stepping manipulates the branches instead, through a continuous transformation of the bus admittance matrix Y : h λ H = (P g -P d ) + j(Q g -Q d ) = v ⊙ (Y λ H v) * = 0 where Y λ H = λ H Y bus +(1-λ H )Y 0 . From a domain perspective, at the first homotopy step λ H = 0, we can replace all branches with zero-resistance low-impedance lines (e.g. impedance = 0 + 10 -4 i), giving a bus admittance matrix Y 0 . This creates all nearly shorted branches with nice properties that 1) lossless lines lead to no real power loss during transmission, i.e., P g = P d , and 2) v i ≈ v ref for any bus i, specifically, all bus voltage magnitudes are close to the reference bus values due to the low voltage drops, and all bus voltage angles will lie within a ϵ-small radius around the refernce bus angle. These nice properties make the problem easily solvable and a trivial solution x 0 available.

4. NUMERICAL RESULTS

We evaluate the efficacy of homotopy heuristics on both general non-convex constrained optimization problems and the real-world problem of power grid optimal power flow. The use of homotopy is expected to enable a more reliable convergence of the neural network models to outperform nonhomotopy results on the following criteria: • Optimality: the objective function f obj achieved by the solution. For ACOPF problem it represents the per-hour cost ($/h) of the generation dispatch. • Feasibility: how much the solution x violates the equality and inequality constraints. Feasibility is quantified by the mean and maximum violation of constraints: mean(h(x)), max(h(x)), mean(relu(g(x))), max(relu(g(x))). In real-world tasks, smaller violation of constraints means the neural network outputs a more practical solution for real-world optimization and control. We compare the different versions of homotopy optimization with non-homotopy settings. The nonhomotopy baseline is a vanilla penalty method as formulated in (4), whereas the homotopy settings are combinations of different heuristics added to the vanilla method. Appendix A describes the details on experiment settings and hyper-parameter tuning, in order for a fair comparison.

4.1. NON-CONVEX OPTIMIZATION WITH RANDOM LINEAR CONSTRAINTS

First consider a problem with non-convex objective and random linear inequality constraints: min x n-1 i=1 (1 -x i ) 2 + 2(x i+1 -x 2 i ) 2 s.t. Ax ≤ b + Cξ (16) n is the problem size (complexity), x is the solution vector representing n variables to solve, ξ is an round(0.4 * n) × 1 vector representing the (known) input parameter that varies across instances, A n×n , b n×1 , C dim(ξ) are randomly generated matrices representing n random linear constraints. 1 : Non-convex problem with n variables and n random linear constraints as n varies as 5, 25, 50, 100. Results over 100 test instances are listed, using metrics of objective, mean and max inequality constraint violations. The violations are formatted as average value (std) in this Table . Results show that the homotopy heuristics enable a smaller violations than vanilla penalty method.

4.2. POWER GRID AC OPTIMAL POWER FLOW PROBLEM

These subsection evaluates on the real-world task of ACOPF problem (defined in Section 2.3). Table 2 shows ACOPF results on a 30-bus system. Homotopy methods improve the feasibility of NN outputs with smaller violations of equality and inequality constraints. See Appendix A for our experiment settings and hyper-parameter tuning.

Method Obj

Mean eq. Max eq. Mean ineq. Max ineq. SaS, SBnds 666 0.0027 (0.0015) 0.010 (0.005) 0.0000 (0.0000) 0.000 (0.001) SaS, GPen 666 0.0023 (0.0014) 0.008 (0.005) 0.0000 (0.0000) 0.000 (0.000) Tx-stepping, SBnds 665 0.0024 (0.0013) 0.008 (0.005) 0.0000 (0.0000) 0.000 (0.000) Tx-stepping, GPen 665 0.0017 (0.0009) 0.006 (0.003) 0.0000 (0.0000) 0.000 (0.000) Load-stepping, SBnds 666 0.0020 (0.0012) 0.007 (0.004) 0.0000 (0.0000) 0.000 (0.000) Load-stepping, GPen 667 0.0024 (0.0016) 0.009 (0.005) 0.0000 (0.0000) 0.000 (0.000) vanilla penalty 673 0.0036 (0.0027) 0.013 (0.010) 0.0000 (0.0000) 0.001 (0.003) 

5. CONCLUSION

This work proposed the use of homotopy optimization for the unsupervised learning of deep learning models constrained by a large set of (nonlinear) equality and inequality constraints. The homotopy heuristics developed in this paper include general-purpose homotopy heuristics based on relaxation of constraint bounds to enlarge the basin of attraction, as well as physics-informed transformation of domain problem leading to trivial starting points lying within the basin of attraction. Our numerical case studies including a family of abstract and real-world problems indicate that the developed homotopy heuristics achieve a more reliable convergence, giving predictions with improved feasibility on unseen data.



Figure 1: Imitation learning VS end-to-end learning using Differentiable Parametric Programming

is in the basin of attraction of the next subproblem H(x, λ (k) H ), i.e., x * (k-1) ∈ B(λ (k) H ). This motivates us to carefully design H(x, λ H ) and select ∆λ H to generate a easy homotopy path where, starting with a proper H(x, 0), each following sub-problem H(x, λ (k) H ) is solved without difficulty due to a good starting point given from H(x, λ (k-1) H). Figure2illustrates such a homotopy path.

Figure 2: A desirable theoretical homotopy path that gradually leads to a minimizer of the original problem: each subproblem is easily solvable by starting from a point in the basin of attraction B(λ H ) Based on these findings, the main idea of our homotopy heuristics include: 1. creating proper transformation of the objective and constraints via a linear combiniation H(x, λ H ) = (1 -λ H )H 0 (x) + λ H F (x), with the initial problem H(x, λ H = 0) = H 0 (x) being easily solvable. The illustration in Figure 2 reveals 2 different ways to design H 0 (x):

, etc. Domain-specific homotopy heuristics have also been developed to design H(x, λ H ) whose solutions are trivial when λ H = 0. For example, in the domain of circuit simulation (Najm, 2010), Gmin-stepping initially shorts all nodes to ground and gradually remove the short-circuit effect; Txstepping initially shorts all transmission lines, and then gradually returns to the original branches. Works in(Pandey et al., 2018)(Pandey et al., 2020)(Jereminov et al., 2019) further applied the circuit-theoretic homotopy ideas to power grid simulation and optimization tasks. On the other hand, some works (Dunlavy & O'Leary, 2005) also extended homotopy methods to global optimization tasks through an ensemble of solution points at each homotopy step, and investigated the probability bound on the convergence to global minimizer. Whereas, many homotopy methods are empirically slower than most other convergence heuristics (e.g., line search, trust region method,





Results of ACOPF problem on case30, over 100 test instances. Mean and max violations are analyzed across the test instances and we list the average value (std). Vanilla penalty method has larger violations of constraints, whereas homotopy methods have smaller violations.

A.1 EXPERIMENT SETTINGS AND HYPER-PARAMETER TUNING

To create a fair comparison in each optimization problem, experiments with and without homotopy heuristics will train with the same NN architecture, Adam optimizer, and learning rate scheduler (StepLR with step=100, gamma=0.1, and a minimal learning rate 10 -5 ; in homotopy methods, lr scheduler is only applied to the last homotopy step). Early stopping is also applied in each experiment to avoid overfitting: each vanilla method trains for 1,000 epochs with warmup=50, pa-tience=200, and each homotopy method trains 100 epochs in each homotopy step with warmup=50, patience=50. All NNs are trained on PyTorch.Experiment settings for the non-convex problem with random linear constraints, see problem definition ( 16) in Section 4.1, are listed below:• Dataset: 50,000 instances (with train/validation/test ratio 8:1:1)• NN architecture: cylinder NN with 4 layers, hidden layer size increases with problem size n by hiddenlayersize = 30 n/5• Penalty weights: w eq = 50, w ineq = 50• Homotopy settings: CObj hasThese hyper-parameters are tuned and kept fixed across all experiments of this non-convex problem.For the ACOPF problem, we propose an additional trick of P g pull-up and used it on all experiments.(P g pull-up) Based on power system domain knowledge, any decision with supply lower than demand is always technically infeasible. To avoid bad predictions of this type, we apply the heuristics of P g pull-up where an additional domain-specific constraint P g -P d ≥ ϵ is added to pull the generation up and thus promote convergence to a point with total supply higher than demand. This constraint is not subject to homotopy heuristics and remains the same in the homotopy path. experiment settings are as follows. Similarly as for other constraints, we have a weight w pullup to impose the additional constraint in penalty method.• Dataset: 50,000 instances (with train/validation/test ratio 8:1:1), data are generated by randomly sampling load profiles in the range of 75% -150% of the base load profile (base load is the load in case data).• Batchsize: 1024 • NN architecture: cylinder NN with 2 layers and hidden layer size = 200,• Penalty weights: w eq = 10 5 , w ineq = 10 6• General homotopy settings: ∆λ H = 0.05, SaS has ϵ H = 0.01, ϵ L = 0, SBnds has ϵ + = 1, ϵ -= 0.• Domain specific homotopy settings: warm homotopy loss has w warm = 10 6• others: P g pull up has w pullup = 10 6 , ϵ = 0.01The hyper-parameters are kept fixed across all experiments.

