SU RCO : LEARNING LINEAR SURROGATES FOR COM-BINATORIAL NONLINEAR OPTIMIZATION PROBLEMS

Abstract

Optimization problems with expensive nonlinear cost functions and combinatorial constraints appear in many real-world applications, but remain challenging to solve efficiently. Existing combinatorial solvers like Mixed Integer Linear Programming can be fast in practice but cannot readily optimize nonlinear cost functions, while general nonlinear optimizers like gradient descent often do not handle complex combinatorial structures, may require many queries of the cost function, and are prone to local optima. To bridge this gap, we propose SurCo that learns linear Surrogate costs which can be used by existing Combinatorial solvers to output good solutions to the original nonlinear combinatorial optimization problem, combining the flexibility of gradient-based methods with the structure of linear combinatorial optimization. We learn these linear surrogates end-to-end with the nonlinear loss by differentiating through the linear surrogate solver. Three variants of SurCo are proposed: SurCo-zero operates on individual nonlinear problems, SurCo-prior trains a linear surrogate predictor on distributions of problems, and SurCo-hybrid uses a model trained offline to warm start online solving for SurCo-zero. We analyze our method theoretically and empirically, showing smooth convergence and improved performance. Experiments show that compared to state-of-the-art approaches and expert-designed heuristics, SurCo obtains lower cost solutions with comparable or faster solve time for two realworld industry-level applications: embedding table sharding and inverse photonic design.

1. INTRODUCTION

Combinatorial optimization problems with linear objective functions, like linear programming (LP) (Chvatal et al., 1983) and mixed integer linear programming (MILP) (Wolsey, 2007) , have been extensively studied in operations research (OR). The resulting high-performance solvers like Gurobi (Gurobi Optimization, LLC, 2022) can solve industrial-scale optimization problems with ten of thousands of variables in a few minutes. However, even with perfect solvers, one issue remains: the cost functions f (x) in many practical problems are nonlinear, and the highly-optimized solvers mainly handle linear or convex formulations while real-world problems have less constrained objectives. For example, in embedding table sharding (Zha et al., 2022a) one needs to distribute embedding tables to multiple GPUs for the deployment of recommendation systems. Due to the batching behaviors within a single GPU and communication cost among different GPUs, the overall latency (cost function) in this application depends on interactions of multiple tables and thus can be highly nonlinear (Zha et al., 2022a) . To obtain useful solutions to the real-world problems, one may choose to directly optimize the nonlinear cost, which is either a black-box output of a simulator (Gosavi et al., 2015; Ye et al., 2019) , or a cost estimator learned by machine learning techniques (e.g., deep models) from offline data (Steiner et al., 2021; Koziel et al., 2021; Wang et al., 2021b; Cozad et al., 2014) . However, many of these direct optimization approaches either rely on human-defined heuristics (e.g., greedy (Korte & Hausmann, 1978; Reingold & Tarjan, 1981; Wolsey, 1982 ), local improvement (Voß et al., 2012; Li et al., 2021) ), or resort to general nonlinear optimization techniques like gradient descent (Ruder, 2016) , reinforcement learning (Mazyavkina et al., 2021) , or evolutionary algorithms (Simon, 2013) . While these approaches can work in practice, they may lead to a slow optimization process, in particular when the cost function is expensive to evaluate, and they often ignore the combinatorial nature of most real-world applications (encoded in the feasible set x ∈ Ω). In this work, we propose a systematic framework SurCo that leverages existing efficient combinatorial solvers to find solutions to nonlinear combinatorial optimization problems arising in realworld scenarios. There are three versions of SurCo, SurCo-zero, SurCo-prior, and SurCohybrid. In SurCo-zero, given a nonlinear differentiable cost f (x) to be minimized, we optimize a linear surrogate cost ĉ so that the surrogate optimizer (SO) min x∈Ω ĉ⊤ x outputs a solution that is expected to be optimal w.r.t. the original nonlinear cost f (x). Due to its linear nature, SO can be solved efficiently with existing solvers, and the surrogate cost ĉ can be optimized in an endto-end manner by back-propagating through the solver (Pogančić et al., 2019; Niepert et al., 2021; Berthet et al., 2020) . In SurCo-prior, we consider a family of nonlinear differentiable functions f (x; y), where y parameterizes problem descriptions. We train the linear surrogate ĉ(y) on a set of optimization problems (called the training set {y i }), and evaluate on a held-out problem y ′ , by directly optimizing SO: x * (y ′ ) := arg min x∈Ω(y) ĉ⊤ (y ′ )x, which avoids optimizing the cost f (x; y ′ ) from scratch. Finally, in SurCo-hybrid we use initial surrogate costs predicted by a fully-trained SurCo-prior and then fine-tune the surrogate costs further using SurCo-zero to leverage both domain knowledge synthesized offline and information about the specific instance. All versions of SurCo are evaluated in two real-world nonlinear optimization problems: embedding table sharding (Zha et al., 2022a) , and photonic inverse design (Schubert et al., 2022) . In both cases, we show that in the on-the-fly setting, SurCo achieves higher quality solutions in comparable or less runtime, faster optimization in wall-clock time with lower solution cost, thanks to the help of an efficient combinatorial solver; in prior, our method obtains better solutions in held-out problems compared to other methods that require training (e.g., reinforcement learning). More specifically, in table sharding SurCo-zero obtains between 14% to 85% improvement in solution quality with between 2% and 23% increase in runtime overhead compared to the greedy baseline, SurCo-prior obtains between 47% and 71% solution quality improvement against the state of the art RL-based table sharding algorithm Zha et al. (2022b) . SurCo-hybrid obtains better solutions than either SurCo-zero or SurCo-prior, with a similar runtime overhead as SurCo-zero. In photonic inverse design, SurCo-zero finds 21% more viable solutions for the beam splitter and twice as many solutions for the wavelength demultiplexers with all problems solving successfully for the mode converter and bend problems, taking between 10% to 64% less time than the pass-through approach from previous work (Schubert et al., 2022) . While the offline trained SurCo-prior misses some optimal solutions in the different settings, it frequently obtains solutions in 0.5% to 2% of the runtime due to not needing to evaluate the objective and perform gradient steps. Again, SurCo-hybrid is able to obtain solutions more often than the other approaches, with a runtime overhead comparable to SurCo-zero. We additionally present theoretical results that help motivate why training a model to predict surrogate linear coefficients exhibits better sample complexity than directly predicting the optimal solution (Li et al., 2018; Ban & Rudin, 2019) .

2. PROBLEM SPECIFICATION

Our goal is to solve the following nonlinear optimization problem describe by y: min x f (x; y) s.t. x ∈ Ω(y) where x ∈ R n are the variables to be optimized, f (x; y) is the nonlinear differentiable cost function to be minimized, Ω(y) is the feasible region, typically specified by linear (in)equalities and integer constraints, and y ∈ Y are the problem instance parameters drawn from a distribution D over Y . For example, in the traveling salesman problem, y can be the distance matrix among cities. We often consider solving a family of optimization problems, described as y ∈ Y . Differentiable cost function. The nonlinear cost function f (x; y) can either be the result of a simulator made differentiable via finite differencing (e.g., JAX (Bradbury et al., 2018) ), or a cost model that is learned from an offline dataset, often generated via sampling multiple feasible solutions within Ω(y), and recording their costs. The cost model often takes the form of a deep neural network. In this work, we assume the following property of f (x; y): Assumption 2.1 (Cost function). During optimization, the cost function f (x; y) and its partial derivative ∂f /∂x are accessible.

