SU RCO : LEARNING LINEAR SURROGATES FOR COM-BINATORIAL NONLINEAR OPTIMIZATION PROBLEMS

Abstract

Optimization problems with expensive nonlinear cost functions and combinatorial constraints appear in many real-world applications, but remain challenging to solve efficiently. Existing combinatorial solvers like Mixed Integer Linear Programming can be fast in practice but cannot readily optimize nonlinear cost functions, while general nonlinear optimizers like gradient descent often do not handle complex combinatorial structures, may require many queries of the cost function, and are prone to local optima. To bridge this gap, we propose SurCo that learns linear Surrogate costs which can be used by existing Combinatorial solvers to output good solutions to the original nonlinear combinatorial optimization problem, combining the flexibility of gradient-based methods with the structure of linear combinatorial optimization. We learn these linear surrogates end-to-end with the nonlinear loss by differentiating through the linear surrogate solver. Three variants of SurCo are proposed: SurCo-zero operates on individual nonlinear problems, SurCo-prior trains a linear surrogate predictor on distributions of problems, and SurCo-hybrid uses a model trained offline to warm start online solving for SurCo-zero. We analyze our method theoretically and empirically, showing smooth convergence and improved performance. Experiments show that compared to state-of-the-art approaches and expert-designed heuristics, SurCo obtains lower cost solutions with comparable or faster solve time for two realworld industry-level applications: embedding table sharding and inverse photonic design. Combinatorial optimization problems with linear objective functions, like linear programming (LP) (Chvatal et al., 1983) and mixed integer linear programming (MILP) (Wolsey, 2007) , have been extensively studied in operations research (OR). The resulting high-performance solvers like Gurobi (Gurobi Optimization, LLC, 2022) can solve industrial-scale optimization problems with ten of thousands of variables in a few minutes. However, even with perfect solvers, one issue remains: the cost functions f (x) in many practical problems are nonlinear, and the highly-optimized solvers mainly handle linear or convex formulations while real-world problems have less constrained objectives. For example, in embedding table sharding (Zha et al., 2022a) one needs to distribute embedding tables to multiple GPUs for the deployment of recommendation systems. Due to the batching behaviors within a single GPU and communication cost among different GPUs, the overall latency (cost function) in this application depends on interactions of multiple tables and thus can be highly nonlinear (Zha et al., 2022a). To obtain useful solutions to the real-world problems, one may choose to directly optimize the nonlinear cost, which is either a black-box output of a simulator (Gosavi et al., 



particular when the cost function is expensive to evaluate, and they often ignore the combinatorial nature of most real-world applications (encoded in the feasible set x ∈ Ω). In this work, we propose a systematic framework SurCo that leverages existing efficient combinatorial solvers to find solutions to nonlinear combinatorial optimization problems arising in realworld scenarios. There are three versions of SurCo, SurCo-zero, SurCo-prior, and SurCohybrid. In SurCo-zero, given a nonlinear differentiable cost f (x) to be minimized, we optimize a linear surrogate cost ĉ so that the surrogate optimizer (SO) min x∈Ω ĉ⊤ x outputs a solution that is expected to be optimal w.r.t. the original nonlinear cost f (x). Due to its linear nature, SO can be solved efficiently with existing solvers, and the surrogate cost ĉ can be optimized in an endto-end manner by back-propagating through the solver (Pogančić et al., 2019; Niepert et al., 2021; Berthet et al., 2020) . In SurCo-prior, we consider a family of nonlinear differentiable functions f (x; y), where y parameterizes problem descriptions. We train the linear surrogate ĉ(y) on a set of optimization problems (called the training set {y i }), and evaluate on a held-out problem y ′ , by directly optimizing SO: x * (y ′ ) := arg min x∈Ω (y) ĉ⊤ (y ′ )x, which avoids optimizing the cost f (x; y ′ ) from scratch. Finally, in SurCo-hybrid we use initial surrogate costs predicted by a fully-trained SurCo-prior and then fine-tune the surrogate costs further using SurCo-zero to leverage both domain knowledge synthesized offline and information about the specific instance. All versions of SurCo are evaluated in two real-world nonlinear optimization problems: embedding table sharding (Zha et al., 2022a) , and photonic inverse design (Schubert et al., 2022) . In both cases, we show that in the on-the-fly setting, SurCo achieves higher quality solutions in comparable or less runtime, faster optimization in wall-clock time with lower solution cost, thanks to the help of an efficient combinatorial solver; in prior, our method obtains better solutions in held-out problems compared to other methods that require training (e.g., reinforcement learning). More specifically, in table sharding SurCo-zero obtains between 14% to 85% improvement in solution quality with between 2% and 23% increase in runtime overhead compared to the greedy baseline, SurCo-prior obtains between 47% and 71% solution quality improvement against the state of the art RL-based table sharding algorithm Zha et al. (2022b) . SurCo-hybrid obtains better solutions than either SurCo-zero or SurCo-prior, with a similar runtime overhead as SurCo-zero. In photonic inverse design, SurCo-zero finds 21% more viable solutions for the beam splitter and twice as many solutions for the wavelength demultiplexers with all problems solving successfully for the mode converter and bend problems, taking between 10% to 64% less time than the pass-through approach from previous work (Schubert et al., 2022) . While the offline trained SurCo-prior misses some optimal solutions in the different settings, it frequently obtains solutions in 0.5% to 2% of the runtime due to not needing to evaluate the objective and perform gradient steps. Again, SurCo-hybrid is able to obtain solutions more often than the other approaches, with a runtime overhead comparable to SurCo-zero. We additionally present theoretical results that help motivate why training a model to predict surrogate linear coefficients exhibits better sample complexity than directly predicting the optimal solution (Li et al., 2018; Ban & Rudin, 2019) .

2. PROBLEM SPECIFICATION

Our goal is to solve the following nonlinear optimization problem describe by y: min x f (x; y) s.t. x ∈ Ω(y) where x ∈ R n are the variables to be optimized, f (x; y) is the nonlinear differentiable cost function to be minimized, Ω(y) is the feasible region, typically specified by linear (in)equalities and integer constraints, and y ∈ Y are the problem instance parameters drawn from a distribution D over Y . For example, in the traveling salesman problem, y can be the distance matrix among cities. We often consider solving a family of optimization problems, described as y ∈ Y . Differentiable cost function. The nonlinear cost function f (x; y) can either be the result of a simulator made differentiable via finite differencing (e.g., JAX (Bradbury et al., 2018) ), or a cost model that is learned from an offline dataset, often generated via sampling multiple feasible solutions within Ω(y), and recording their costs. The cost model often takes the form of a deep neural network. In this work, we assume the following property of f (x; y): Assumption 2.1 (Cost function). During optimization, the cost function f (x; y) and its partial derivative ∂f /∂x are accessible. Learning a good nonlinear cost model f is highly non-trivial for practical applications (e.g., Al-phaFold (Jumper et al., 2021) , Density Functional Theory (Nagai et al., 2020) , cost model for embedding tables (Zha et al., 2022a) ) and is beyond the scope of this work. Evaluation Metric. In real-world applications, querying f can be slow and expensive, and thus a lower number of queries while getting better quality solution is the goal. We mainly focus on two aspects: how good the solution x is, by checking the value of f ( x; y), and how many queries of the nonlinear function f are needed during optimization in order to achieve the solution x. Linear/nonlinear cost function. When f (x; y) is linear w.r.t x, and the feasible region can be encoded using mixed integer programs or other mathematical programs, the problem can be solved efficiently using existing scalable optimization solvers. When f (x; y) is nonlinear, we propose SurCo that learns a surrogate linear objective function, which allow us to leverage these existing scalable optimization solvers, and which results in a solution that has high quality with respect to the original hard-to-encode objective function f (x; y). We will elaborate in the following sections. We start from the simplest case in which we focus on a single instance with f (x) = f (x; y) and Ω = Ω(y). SurCo-zero aims to optimize the following objective: (SurCo-zero) : min c L zero (c) := f (g Ω (c)) where the surrogate optimizer g Ω : R n → R n is the output of certain combinatorial solvers with linear cost weight c ∈ R n and feasible region Ω ⊆ R n . For example, g Ω can be the following (n is the number of variables to be optimized): g Ω (c) := arg min x c ⊤ x s.t. x ∈ Ω := {Ax ≤ b, x ∈ Z n } (3) which is the output of a MILP solver. Thanks to previous works (Ferber et al., 2020; Pogančić et al., 2019) , we can efficiently compute the partial derivative ∂g Ω (c)/∂c. Intuitively, this means that g Ω (c) can be backpropagated through. Since f is also differentiable with respect to the solution it is evaluating, we thus can optimize Eqn. 2 in an end-to-end manner using any gradient-based optimizer. That is, c(t + 1) = c(t) -α ∂gΩ ∂c ∂f ∂x , where α is the learning rate. The procedure starts from a randomly initialized c(0) and converges at a local optimal solution of c. While Eqn. 2 is still nonlinear optimization and there is no guarantee about the quality of the final solution c, we argue that optimizing Eqn. 2 is better than optimizing the original nonlinear cost min x∈Ω f (x). Furthermore, while we cannot guarantee optimality, we are able to guarantee feasibility by leveraging a linear combinatorial solver. We note that SurCo is somewhat limited to problems without interior integer solutions, since the linear surrogate cannot yield interior points. However, many real-world settings, such as our two domains, consider making binary decisions which lack interior integer points. Intuitively, instead of optimizing directly over the solution space x, we optimize over the space of surrogate costs c, and delegate the combinatorial feasibility requirements of the nonlinear problem to SoTA combinatorial solvers. Compared to naive approaches that directly optimize f (x) via general optimization techniques, our method readily handles complex constraints of the feasible regions, and thus makes the optimization procedure easier. Furthermore, it also helps escape from local minima, thanks to the embedded search component of existing combinatorial solvers (e.g., branch-and-bound (Land & Doig, 2010) in MILP solvers). As we see in the experiments, this is particularly important when the problem becomes large-scale with more local optima. This approach works well when we are optimizing individual instances and may not have access to offline training data or the training time is cost-prohibitive.

3.2. SU RCO -P R I O R : OFFLINE SURROGATE TRAINING

We now discuss more general cases, where the nonlinear loss function f (x; y) represents a family of cost function to be optimized. Here the description of each problem instance y is drawn from a fixed problem distribution D. We then ask the following question: how can we find solutions to a batch of training instances D train := {y i } N i=1 , gain useful knowledge of the cost functions, and leverage such knowledge in held-out evaluation problem instances D eval to accelerate the optimization procedure? Following standard machine learning practice, let us first consider a naive two-stage approach. In the data collection stage, we simply apply SurCo-zero(Eqn. 2) to every y i separately to get N surrogate cost vectors c i . Then in the training stage, we train a regressor ĉ = ĉ(y; θ) on the dataset {(y i , c i )} to learn to predict the surrogate costs from the problem features. Here ĉ is a parameterized model (e.g., a deep network) with the parameters θ to be learned. This learned regressor ĉ(y; θ) can thus be used for a held-out problem instance y ′ to directly predict c ′ = ĉ(y ′ ; θ) and get the solution x ′ = g Ω(y ′ ) (c ′ ) via surrogate optimizer (SO). While this approach is simple, the N optimization procedures in the data collection stage are independent of each other, and can lead to excessive number of calls to f that are not helpful. E.g., if an optimization procedure converges to a bad local solution, then even if it achieves perfect convergence, which requires a lot of function calls, the resulting data point is still of low quality. This motivates us to add a regularizer for the optimization: (SurCo-prior-λ) : min θ,{ci} L prior (θ, {c i }; λ) := N i=1 f (g Ω(yi) (c i ); y i )+λ∥c i -ĉ(y i ; θ))∥ 2 (4) Note that when λ = 0, it reduces to N independent optimizations, while when λ > 0, the surrogate costs {c i } interact with each other. The intuition is that, the regressor ĉ(y; θ), even if not trained fully, can be very useful to guide c i rather than just using its randomly initialized version. Furthermore, if ĉ is a mapping to global optimal solution of x, then it will pull the solutions out of local optima to re-target towards global ones, even when starting from poor initialization, yielding fast convergence and better final solutions for individual optimization instances. A special case is when λ → +∞, we arrive at a novel objective that jointly learns the surrogate cost function, given the training set D train : (SurCo-prior) : min θ L prior (θ) := N i=1 f (g Ω(yi) (ĉ(y i ; θ)); y i ) (5) This approach is useful when the goal is to find high-quality solutions for unseen instances of a problem distribution when the upfront cost of offline training is acceptable but the cost of optimizing on-the-fly is prohibitive. Here, we require access to a distribution of training optimization problems, but at test time only require the feasible region and not the nonlinear objective.

3.3. SU RCO -H Y B R I D : FINE-TUNING A PREDICTED SURROGATE

Naturally, we consider SurCo-hybrid, a hybrid approach which initializes the coefficients of SurCo-zero with the coefficients predicted from SurCo-prior which was trained on offline data. This allows SurCo-hybrid to start out optimization from an initial prediction which has good performance for the distribution at large but which is then fine-tuned for the specific instance. Formally, we initialize c(0) = ĉ(y i ; θ) and then continue optimizing c based on the update from SurCo-zero. This approach is geared towards optimizing the nonlinear objective using a highquality initial prediction that is based on the problem distribution and then fine-tuning the objective coefficients based on the specific problem instance at test time. Here, high performance comes at the runtime cost of both having to train offline on a problem distribution as well as performing fine-tuning steps on-the-fly. However, this additional cost is often worthwhile when the main goal is to find the best possible solutions by leveraging synthesized domain knowledge in combination with individual problem instances as arises in chip design (Mirhoseini et al., 2021) and compiler optimization (Zhou et al., 2020) .

3.4. COST REGRESSION VERSUS SOLUTION REGRESSION: A THEORETICAL ANALYSIS

We also want to compare SurCo with the previous works on ML optimizers (Ban & Rudin, 2019) that try to directly learn the mapping from problem description y to the solution, i. While this is conceptually simple, there exist fundamental difficulties to learn such a direct mapping. First, as mentioned above, it can be quite expensive to obtain the optimal solution x * (y) due to the nature of nonlinear optimization and the query cost. Second, even if a perfect dataset D direct is accessible, the number of samples needed to learn a mapping to directly predict x * (y) is related to the Lipschitz constant L of the mapping, and for a direct mapping, L can be very large.

3.4.1. LIPSCHITZ CONSTANT AND SAMPLE COMPLEXITY

Let us first consider the sample complexity of solution regression methods as described above. Formally, consider fitting any function ϕ : R d ⊇ Y → R m with a dataset {y i , ϕ i }. Here Y is a compact region with finite volume vol(Y ). The Lipschitz constant L is the smallest number so that ∥ϕ(y 1 ) -ϕ(y 2 )∥ 2 ≤ L∥y 1 -y 2 ∥ 2 holds for any y 1 , y 2 ∈ Y . The following theorem shows that if the dataset covers the space Y , we could achieve high accuracy prediction: ∥ϕ(y) -φ(y)∥ 2 ≤ ϵ for any y ∈ Y . Definition 3.1 (δ-cover). A dataset D direct := {(y i , ϕ i )} N i=1 δ-covers the space Y , if for any y ∈ Y , there exists at least one y i so that ∥y -y i ∥ 2 ≤ δ. Please find all proofs in the Appendix. While we do not rule out a more advanced regressor than 1nearest-neighbor that leads to better sample complexity, the lemmas demonstrate that the Lipschitz constant L plays an important role in sample complexity.

3.4.2. DIFFERENCE BETWEEN COST AND SOLUTION REGRESSION

In the following we will show that in certain cases, the direct prediction y → x * (y) could have an infinitely large Lipschitz constant L. Note that this theorem applies to a wide variety of combinatorial optimization problems. For example, when Y is a connected region and the optimization problem can be formulated as an integer program, the optimal solution set x * (Y ) := {x * (y) : y ∈ Y } is a discrete set of integral vertices, so the theorem applies. Combined with analysis in Sec. 3.4.1, we know the mapping y → x * (y) is hard to learn even with a lot of samples. On the other hand, the mapping y → c(y) can avoid too many connected components in its image c(Y ), by connecting disjoint components of x * (Y ) together.

4. EMPIRICAL EVALUATION

We evaluate the two variants of SurCo on two real-world settings, embedding table sharding and inverse photonic design. Both have industrial application. Each setting consists of a family of problem instances with varying feasible region and nonlinear objective function. The task of sharding embedding tables arises in the deployment of large scale neural network models which operate over both sparse and dense inputs (e.g., in recommendation systems (Zha et al., 2022a; b; Sethi et al., 2022) ). Given T embedding tables and D homogeneous devices, the goal is to distribute the tables among the devices such that no device's memory limit is exceeded, while the tables are processed efficiently. Formally, let x t,d be the binary variable indicating whether table t is assigned to device d, and x := {x t,d } ∈ {0, 1} T D be the collection of the variables. The optimization problem is:

4.1. EMBEDDING TABLE SHARDING

min x f (x; y) s.t. x ∈ Ω(y) := x : ∀t, t x t,d = 1, ∀d, t m t x t,d ≤ M (6) Here the problem description y includes table memory usage {m t }, and capacity M of each device. d x t,d = 1 means each table t should be assigned to exactly one device, and t m t x t,d ≤ M means the memory consumption at each device d should not exceed its capacity. The nonlinear cost function f (x; y) is the latency, i.e., the runtime of the longest-running device. Due to shared computation (e.g., batching) among the group of assigned tables, and communication costs across devices, the objective is highly nonlinear. f (x; y) is well-approximated by a sharding plan runtime estimator proposed by Dreamshard (Zha et al., 2022b) . SurCo learns to predict T × D surrogate cost ĉt,d , one for each potential table-device assignment. During training, the gradients through combinatorial solver ∂g/∂c are computed via CVXPYLayers (Agrawal et al., 2019a) and the integrality constraints are relaxed. We found that in practice, we obtained solutions that were mostly integral in that only one table on any given device was fractional. At test time we solve for the integer solution using SCIP (Achterberg, 2009) . Settings. We evaluate SurCo on the publicly available Deep Learning Recommendation Model (DLRM) dataset (Naumov et al., 2019) . We consider 6 settings: 10, 20, 30, 40, 50, and 60 tables are placed to 4 devices with each GPU device having a 5GB memory limit. Each setting has 100 problem instances (50 training and 50 test). Baselines. For SurCo-zero baselines, we use Greedy that greedily allocates tables to devices while observing memory limits according to the predicted latency f , and Domain-Heuristic, the domain-expert algorithm of allocating tables to balance the aggregate dimension (Zha et al., 2022b) . For SurCo-prior, we use Dreamshard, the SoTA embedding table sharding algorithm that requires training an offline RL policy. Results. Fig. 2 , SurCo-zero finds lower latency sharding plans than the baselines, while it takes slightly longer than Domain-Heuristic and DreamShard due to taking optimization steps rather than selecting based on a heuristic feature or reinforcement learned policy. SurCo-prior obtains lower latency solutions in about the same time as DreamShard with a slight increase in overhead due to using SCIP (Achterberg, 2009) , a branch and bound MILP solver. Lastly, SurCo-hybrid obtains the best solutions in terms of solution quality and has runtime comparable to SurCo-zero since at test time it performs similar operations. In smaller problem instances (T = 10 to T = 40), SurCo-prior obtains better quality solutions than its impromptu counterpart, SurCo-zero, likely due to training on a variety of examples and being able to better escape local optima in any given problem instance as might be the case with the impromptu solver. However, as the problem size increases and more tables are available for placement, SurCo-zero gives better performance by optimizing for the test instances in question as opposed to SurCo-prior which only uses training data to obtain surrogate costs. Using SurCo-hybrid, we are able to obtain the best quality solutions but incur the upfront cost of pretraining and the deployment-time cost of optimizing the coefficients on-the-fly. runtime in log scale. For both, lower is better. We compare against the pass-through gradient approach proposed in Schubert et al. (2022) . We observe that SurCo-prior achieves similar success rates to the previous approach Pass-through with a substantially improved runtime. Additionally, SurCo-zero runs comparably or faster, while finding more valid solutions than Pass-through. SurCo-hybrid obtains valid solutions most often and is faster than SurCo-zero at the expense of pretraining. Striped approaches use pretraining. Photonic devices play an important role in high-speed communication (Marpaung et al., 2019 ), quantum computing (Arrazola et al., 2021) , and machine learning hardware acceleration (Wetzstein et al., 2020) . The photonic components can be formulated as a binary 2D grid, with each cell being filled or void. There are constraints for binary patterns: only those that can be drawn by a physical brush instrument with certain cross shape can be manufactured. It remains challenging to find designs that are manufacturable and satisfy design specifications (e.g. beam splitting). An example solution developed by SurCo is shown in Figure 4b : coming from the top, beams are routed to the left or right, depending on wavelength. The solution is also manufacturable: a 3-by-3 brush cross can fit in all filled and void space. Given the design, existing work (Hughes et al., 2019) enables differentiation of the design misspecification cost, evaluated as how far off the transmission of the wavelengths of interest is from the desired locations, with zero design loss meaning that the specification is satisfied. Researchers also develop a standard benchmark of inverse photonic design problems (Schubert et al., 2022) . Settings. We compare our approaches against the "Pass-Through" method (Schubert et al., 2022) on randomly generated instances of the four types of problems in Schubert et al. (2022) : Waveguide Bend, Mode Converter, Wavelengths Division Multiplexer, and Beam Splitter. We generate 50 instances in each setting (25 training/25 test). Further generation details are in the appendix. We evaluated several algorithms described in the appendix, such as genetic algorithms and derivativefree optimization, which failed to find physically feasible solutions. We consider two wavelengths (1270nm/1290nm), and optimize at a resolution of 40nm, visualizing the test results in Fig. 3 . Results. Fig. 3 , SurCo-zero consistently finds as many or more valid devices compared to the Pass-Through baseline (Schubert et al., 2022) . Additionally, since the on-the-fly solvers stop when they either find a valid solution, or reach a maximum of 200 steps, the runtime of SurCozero is slightly lower than the Pass-Through baseline. SurCo-prior obtains similar success rates as Pass-Through while taking two orders of magnitude less time as it does not require expensive impromptu optimization, making SurCo-prior a promising approach for large-scale settings or when solving many slightly-varied instances. Lastly, SurCo-hybrid performs best in terms of solution loss, finding valid solutions more often than the other approaches. It also takes less runtime than the other on-the-fly approaches since it is able to reach valid solutions faster, although it still requires optimization on-the-fly so it takes longer than SurCo-prior. We visualize the convergence of impromptu solvers in Fig. 4a where SurCo-zero has smoother and faster convergence compared to the Pass-through approach. (Schubert et al., 2022) . SurCozero smoothly lowers the loss while the pass-through baseline converges noisily. Also, SurCohybrid starts out with a high-quality solution and fine-tunes until an optimal solution is reached. We also visualize the SurCo-zero solution with magnitudes of the two wavelengths of interest which we successfully route from the input at the top to the two different waveguides at the bottom.

5. RELATED WORK

Differentiable Optimization Previous work differentiated through several optimization problems, calculating how changes in input parameters impact the optimal solution. Initially, a differentiable convex quadratic programming solver called OptNet (Amos & Kolter, 2017) proposed to implicitly differentiate the optimal solution with respect to input parameters through the KKT optimality con-ditions, a set of linear equations that determined the optimal solution. Following this, researchers differentiated through linear programs (Wilder et al., 2019a) , submodular optimization problems (Djolonga & Krause, 2017; Wilder et al., 2019a) , cone programs (Agrawal et al., 2019a; b) , MaxSAT (Wang et al., 2019) , Mixed Integer Linear Programming (Ferber et al., 2020; Mandi et al., 2020) , Integer Linear Programming (Mandi et al., 2020) , dynamic programming solvers Demirovic et al. (2020) , blackbox discrete linear optimizers (Pogančić et al., 2019; Rolínek et al., 2020a; b) , maximum likelihood estimation (Niepert et al., 2021) , kmeans clustering (Wilder et al., 2019b ), knapsack (Guler et al., 2022; Demirović et al., 2019) , the cross-entropy method (Amos & Yarats, 2020) , and SVM training (Lee et al., 2019) . Additionally, Wang et al. (2020a) learned to linearly combine LP variables. SurCo can use these differentiable surrogates based on the problem domain. Task Based Learning Task-based learning solves distributions of linear or quadratic optimization problems with the true objective hidden at test time but available for training (Elmachtoub & Grigas, 2022; Donti et al., 2017; El Balghiti et al., 2019; Liu & Grigas, 2021; Hu et al., 2022) . (Donti et al., 2021) predicts and corrects solutions for continuous nonlinear optimization. Bayesian optimization (BO) (Shahriari et al., 2016) , optimizes blackbox functions by approximating the objective with a learned model that can be optimized over. Recent work optimizes individual instances over discrete spaces like hypercubes (Baptista & Poloczek, 2018) , graphs (Deshwal et al., 2021), and MILP (Papalexopoulos et al., 2022) . Data reuse from previous runs is proposed to optimize multiple correlated instances (Swersky et al., 2013; Feurer et al., 2018) . However, the surrogate Gaussian Process (GP) models are memory and time intensive in high-dimensional settings. Recent work has addressed GP scalability via gradient updates (Ament & Gomes, 2022); however, it is unclear whether GP can scale in conjunction with combinatorial solvers. Machine learning is also used to guide combinatorial algorithms. Several approaches produce combinatorial solutions (Zhang & Dietterich, 1995; Khalil et al., 2017; Kool et al., 2018; Nazari et al., 2018; Zha et al., 2022a; b) . Here, approaches are limited to simple feasible regions by iteratively building solutions for problems like routing, assignment, or covering. However, these approaches fail to handle more complex constraints. Other approaches set parameters that improve solver runtime (Khalil et al., 2016; Bengio et al., 2021) . Learning Latent Space for Optimization As we learn latent linear objectives to optimize nonlinear functions, other approaches learn latent embeddings for faster solving. Faloutsos & Lin (1995) proposed FastMap, which learns latent object embeddings for efficient search. Variants of FastMap are used in graph optimization and shortest path (Cohen et al., 2018; Hu et al., 2022; Li et al., 2019) . Wang et al. (2020b; 2021a) ; Yang et al. (2021) ; Zhao et al. (2022) use monte carlo tree search to perform single and multi-objective blackbox optimization by learning to split the search space. Mixed Integer Nonlinear Programming (MINLP) SurCo-zero falls into the broad family of MINLP solvers, optimizing nonlinear and nonconvex objectives over discrete linear feasible regions. Specialized solvers handle many problem variants in the MINLP space (Burer & Letchford, 2012; Belotti et al., 2013) ; however, scalabliliy in the nonconvex setting is usually obtained by optimization experts who rely on problem-specific solving techniques such as making piecewise linear approximations, convexifying the objective, or exploiting special structure.

6. CONCLUSION

We introduced SurCo, a method for learning linear surrogates for combinatorial nonlinear optimization problems. SurCo learns linear objective coefficients for a surrogate solver which results in solutions that minimize the nonlinear loss via gradient descent. At its core, SurCo differentiates through the surrogate solver which maps the predicted coefficients to a combinatorially feasible solution, combining the flexibility of gradient-based optimization with the structure of combinatorial solvers. We presented three variants of SurCo, SurCo-zero which optimizes individual instances, SurCo-prior which trains a coefficient prediction model offline, and SurCo-hybrid which fine-tunes the coefficients predicted by SurCo-prior on individual test instances. While SurCo's performance is somewhat limited to binary problems due to the lack of interior integer points, we find that many real-world domains operate on binary decision variables. We evaluated variants of SurCo on two domains against the state of the art approaches used in industry, obtaining better solution quality for similar or better runtime in the embedding table sharding domain, and quickly identifying viable photonic devices. On the other hand, by continuity of the curve γ, there exists a constant C(t 0 ) so that ∥γ(t ′ )γ(t ′′ )∥ 2 ≤ C(t 0 )∥t ′ -t ′′ ∥ 2 ≤ 2C(t 0 )ϵ. Then we have L = max y,y ′ ∈Y ∥ϕ(y) -ϕ(y ′ )∥ 2 ∥y -y ′ ∥ 2 ≥ ∥ϕ(γ(t ′ )) -ϕ(γ(t ′′ ))∥ 2 ∥γ(t ′ ) -γ(t ′′ )∥ 2 ≥ d min 2C(t 0 )ϵ → +∞ B EXPERIMENT DETAILS This is motivated by previous work (Schubert et al., 2022) that also uses the fixed brush shape filter and tanh operation to transform the latent parameters into a continuous solution that is projected onto the space of physically feasible solutions. In each setting, optimization is done on a binary grid of different sizes to meet fabrication constraints, namely that a 3 by 3 cross must fit inside each fixed and void location. In the beam splitter the design is an 80 × 60 grid, in mode converter it is a 40 × 40 grid, in waveguide bend it is a 40 × 40 grid, in wavelength division multiplexer it is an 80 × 80 grid. Previous work formulated the projection as finding a discrete solution that minimized the dot product of the input continuous solution and proposed discrete solution. The authors then updated the continuous solution by computing gradients of the loss with respect to the discrete solution and using pass-through gradients to update the continuous solution. By comparison, our approach treats the projection as an optimization problem and updates the objective coefficients so that the resulting projected solution moves in the direction of the desired gradient. Task Randomization mode converter randomize the right and left waveguide width bend setting randomize the waveguide width and length beam splitter randomize the waveguide separation, width and length wavelength division multiplexer randomize the input and output waveguide locations Table 1 : Task randomization of 4 different tasks in inverse photonic design. To compute the gradient of this blackbox projection solver, we leverage the approach suggested by Pogančić et al. ( 2019) which calls the solver twice, once with the original coefficients, and again with coefficients that are perturbed in the direction of the incoming solution gradient as being an "improved solution". The gradient with respect to the input coefficients are then the difference between the "improved solution" and the solution for the current objective coefficients.

C PSEUDOCODE

Here is the pseudocode for the different variants of our algorithm. Each of these leverage a differentiable optimization solver to differentiate through the surrogate optimization problem. c ←grad update(c, ∇ c loss) 7: end while 8: return x D ADDITIONAL FAILED BASELINES SOGA -Single Objective Genetic Algorithm Using PyGAD (Gad, 2021), we attempted several approaches for both table sharding and inverse photonics settings. While we were able to obtain feasible table sharding solutions, they underperformed the greedy baseline by 20%. Additionally, they were unable to find physically feasible inverse photonics solutions. We varied between random, swap, inversion, and scramble mutations and used all parent selection methods but were unable to find viable solutions. DFL -A Derivative-Free Library We could not easily integrate DFLGEN (Liuzzi et al., 2015) into our pipelines since it operates in fortran and we needed to specify the feasible region with python in the ceviche challenges. DFLINT works in python but took more than 24 hours to run on individual instances which reached a timeout limit. We found that the much longer runtime made this inapplicable for the domains of interest. Nevergrad We enforced integrality in Nevergrad (Rapin & Teytaud, 2018) using choice variables which selected between 0 and 1. This approach was unable to find feasible solutions for inverse photonics in less than 10 hours. For table sharding we obtained solutions by using a choice variable for each table, selecting one of the available devices. This approach was not able to outperform the greedy baseline and took longer time so it was strictly dominated by the greedy approach.



e. solution regression. Given a set of training instances D train from distribution D, these approaches first collect a set of training samples D direct := {y, x * (y) : y ∈ D train }, and then learn a function x * (y) to fit the training samples.

Lemma 3.1 (Sufficient condition of prediction with ϵ-accuracy). If the dataset D direct (ϵ/L)-cover Y , then for any y ∈ Y , a 1-nearest-neighbor regressor φ leads to ∥ φ(y) -ϕ(y)∥ 2 ≤ ϵ. Lemma 3.2 (Lower bound of sample complexity for ϵ/L-cover). To achieve ϵ/L-cover of Y , the size of the training set N ≥ N 0 (ϵ) := vol(Y ) vol0 L ϵ d , where vol 0 is the volume of unit ball in d-dimension.

To show this, let us consider a general mapping ϕ : R d ⊇ Y → R m . Let ϕ(Y ) be the image of Y under mapping ϕ and κ(Y ) be the number of connected components for region Y . Theorem 3.1 (A case of infinite Lipschitz constant). If the minimal distance d min for different connected components of ϕ(Y ) is strictly positive, and κ(ϕ(Y )) > κ(Y ), then the Lipschitz constant of the mapping ϕ is infinite.

Figure 2: Table placement plan latency (a) and solver runtime (b). We evaluate SurCo against Dreamshard (Zha et al., 2022b) a SoTA offline RL sharding tool, a domain-heuristic of assigning tables based on dimension, and a greedy heuristic based on the estimated runtime increase. Striped approaches require pre-training.

Figure 3: (a) The solution loss (% of failed instances when the design loss is not 0), and (b) test time solver

Figure4: Inverse photonic design convergence for a single instance(Schubert et al., 2022). SurCozero smoothly lowers the loss while the pass-through baseline converges noisily. Also, SurCohybrid starts out with a high-quality solution and fine-tunes until an optimal solution is reached. We also visualize the SurCo-zero solution with magnitudes of the two wavelengths of interest which we successfully route from the input at the top to the two different waveguides at the bottom.

SurCo-zeroInput: Ω, y, f 1: c ← init surrogate coefs(y) 2: while not converged do 3:x ← arg min x∈Ω(y) c ⊤ x ←grad update(c, ∇ c loss) 6: end while 7: return x Algorithm 2 SurCo-prior TrainingInput: Ω, D train = {y i } N i=1 , f 1: θ ← init surrogate model() 2: while not converged do 3: Sample batch B = {y i } k i ←grad update(θ, ∇ θ loss) 10: end while Algorithm 3 SurCo-prior Deployment Input: Ω, D train = {y i } N i=1 , f , y test 1: θ ← train SurCo-prior(Ω, D train , f ) 2: c ← ĉ(y; θ) 3: x ← arg min x∈Ω(y) c ⊤ x 4: return x Algorithm 4 SurCo-hybrid Input: Ω, D train = {y i } N i=1 , f , y test 1: θ ← train SurCo-prior(Ω, D train , f ) 2: c ← ĉ(y; θ) 3: while not converged do 4:x ← arg min x∈Ω(y) c ⊤ x 5: loss ← f (x; y) 6:

Loss (Latency)

Runtime (s)

B.1 SETUPSExperiments are performed on a cluster of identical machines, each with 4 Nvidia A100 GPUs and 32 CPU cores, with 1T of RAM and 40GB of GPU memory. Additionally, we perform all operations in Python (Van Rossum & Drake, 2009) using Pytorch(Paszke et al., 2019). For embedding table placement, the nonlinear cost estimator is trained for 200 iterations and the offlinetrained models of Dreamshard and SurCo-prior are trained against the pretrained cost estimator for 200 iterations. The DLRM Dataset Naumov et al. (2019) is available at https: //github.com/facebookresearch/dlrm_datasets, and the dreamshard(Zha et al.,  2022b)  code is available at https://github.com/daochenzha/dreamshard. Additional details on dreamshard's model architecture and features can be obtained in the paper and codebase. Training time for the networks used in SurCo-prior and SurCo-hybrid are on average 8 hours for the inverse photonic design settings and 6, 21, 39, 44, 50, 63 minutes for DLRM 10, 20, 30, 40, 50, 60 settings respectively.The table features are the same used inZha et al. (2022b), and sinusoidal positional encodingVaswani et al. (2017) is used as device features so that the learning model is able to break symmetries between the different tables and effectively group them onto homogeneous devices. The table and device features are concatenated and then fed into Dreamshard's initial fully-connected table encoding module to obtain scalar predictions ĉt,d for each desired objective coefficient. The architecture is trained with the Adam optimizer with learning rate 0.0005. The input design specification (a 2D image) is passed through a 3 layer convolutional neural network with ReLU activations and a final layer composed of filtering with the known brush shape. Then a tanh activation is used to obtain surrogate coefficients ĉ, one component for each binary input variable. The architecture is trained with the Adam optimizer with learning rate 0.001.

annex

A PROOFS Lemma 3.1 (Sufficient condition of prediction with ϵ-accuracy). If the dataset D direct (ϵ/L)-cover Y , then for any y ∈ Y , a 1-nearest-neighbor regressor φ leads to ∥ φ(y) -ϕ(y)∥ 2 ≤ ϵ.Proof. Since the dataset is a ϵ/L-cover, for any y ∈ Y , there exists at least one y i so that ∥yy i ∥ 2 ≤ ϵ/L. Let y nn be the nearest neighbor of y, and we have:From the Lipschitz condition and the definition of 1-nearest-neighbor classifier ( φ(y) = ϕ(y nn )), we know that Proof. We prove by contradiction. If N < N 0 (ϵ), then for each training sample (y i , ϕ i ), we create a ballTherefore, there exists at least one y ∈ Y so that y / ∈ B i for any 1 ≤ i ≤ N . This means that y is not ϵ/L-covered. Proof. Let R 1 , R 2 , . . . , R K be the K = κ(ϕ(Y )) connected components of ϕ(Y ), and Y 1 , Y 2 , . . . , Y J be the J = κ(Y ) connected components of Y . From the condition, we know that, by pigeonhole principle, there exists one Y j that contains at least part of the two pre-images S k and S k ′ with k ̸ = k ′ . This means thatThen we pick y ∈ S k ∩ Y j and y ′ ∈ S k ′ ∩ Y j . Since y, y ′ ∈ Y j and Y j is a connected component, there exists a continuous path γ : [0, 1] → Y j so that γ(0) = y and γ(1) = y ′ . Therefore, we haveFor any sufficiently small ϵ > 0, we have:• By the definition of sup, we know there exists t 0 -ϵ ≤ t ′ ≤ t 0 so that ϕ(γ(t ′ )) ∈ R k .• Picking t ′′ = t 0 + ϵ < 1, then ϕ(γ(t ′′ )) ∈ R k ′′ with some k ′′ ̸ = k.

