LINEARLY CONSTRAINED BILEVEL OPTIMIZATION: A SMOOTHED IMPLICIT GRADIENT APPROACH

Abstract

This work develops an analysis and algorithms for solving a class of bilevel optimization problems where the lower-level (LL) problems have linear constraints. Most of the existing approaches for constrained bilevel problems rely on value function based approximate reformulations, which suffer from issues such as nonconvex and non-differentiable constraints. In contrast, in this work, we develop an implicit gradient-based approach, which is easy to implement, and is suitable for machine learning applications. We first provide an in-depth understanding of the problem, by showing that the implicit objective for such problems is in general non-differentiable. However, if we add some small (linear) perturbation to the LL objective, the resulting implicit objective becomes differentiable almost surely. This key observation opens the door for developing (deterministic and stochastic) gradient-based algorithms similar to the state-of-the-art ones for unconstrained bi-level problems. We show that when the implicit function is assumed to be strongly-convex, convex and weakly-convex, the resulting algorithms converge with guaranteed rate. Finally, we experimentally corroborate the theoretical findings and evaluate the performance of the proposed framework on numerical and adversarial learning problems. To our knowledge, this is the first time that (implicit) gradientbased methods have been developed and analyzed for the considered class of bilevel problems.

1. INTRODUCTION

Bilevel optimization problems (Colson et al., 2005; Dempe & Zemkoho, 2020) can be used to model an important class of hierarchical optimization tasks with two levels of hierarchy, the upper-level (UL) and the lower-level (LL). The key characteristics of bilevel problems are: 1) the solution of the UL problem requires access to the solution of the LL problem and, 2) the LL problem is parametrized by the UL variable. Bilevel optimization problems arise in a wide range of machine learning applications, such as meta-learning (Rajeswaran et al., 2019; Franceschi et al., 2018) , data hypercleaning (Shaban et al., 2019 ), hyperparameter optimization (Sinha et al., 2020; Franceschi et al., 2018; 2017; Pedregosa, 2016 ), adversarial learning (Li et al., 2019; Liu et al., 2021a; Zhang et al., 2021) , as well as in other application domains such as network optimization (Migdalas, 1995 ), economics (Cecchini et al., 2013) , and transport research (Didi-Biha et al., 2006; Kalashnikov et al., 2010) . In this work, we focus on a special class of stochastic bilevel optimization problems, where the LL problem involves the minimization of a strongly convex objective over a set of linear inequality constraints. More precisely, we consider the following formulation: min x∈X G(x) := f (x, y * (x)) := E ξ [ f (x, y * (x); ξ)] , s.t. y * (x) ∈ arg min y∈R d ℓ h(x, y) Ay ≤ b , where ξ ∼ D represents a stochastic sample of the objective f (•, •), X ⊆ R du is a convex and closed set, f : X × R d ℓ → R is the UL objective, h : X × R d ℓ → R is the LL objective, and f, h are smooth functions. We focus on the problems where h(x, y) is strongly convex with respect to y. The matrix A ∈ R k×d ℓ , and vector b ∈ R k define the linear constraints. In the following, we refer to (1a) as the UL problem, and to (1b) as the LL one. The success of the bilevel formulation and its algorithms in many machine learning applications can be attributed to the use of the efficient (stochastic) gradient-based methods (Liu et al., 2021a) . These methods take the following form, in which an (approximate) gradient direction of the UL problem is computed (using chain rule), and then the UL variable is updated using gradient descent (GD): ∇G(x) ≈ ∇ x f (x, y * (x)) + [∇y * (x)] T ∇ y f (x, y * (x)) GD Update:x + = x -β ∇G(x). (2) The gradient of G(x) is often referred as the implicit gradient. However, computing this implicit gradient not only requires access to the optimal y * (x), but also assumes differentiability of the mapping y * (x) : X → R d ℓ . One can potentially solve the LL problem approximately and obtain an approximation y(x) such that y(x) ≈ y * (x), and use it to compute the implicit gradient (Ghadimi & Wang, 2018). Unfortunately, not all solutions y * (x) are differentiable, and when they are not the above approach cannot be applied. It is known that when the LL problem is strongly convex and unconstrained, then ∇y * (x) can be easily evaluated using the implicit function theorem (Ghadimi & Wang, 2018) . This is the reason that the majority of recent works have focused on developing algorithms for the class of unconstrained bilevel problems (Ghadimi & Wang, 2018; Hong et al., 2020; Ji et al., 2021; Khanduri et al., 2021b; Chen et al., 2021a) . However, when the LL problem is constrained, ∇y * (x) might not even exist. In that case, most works adopt a value function-based approach to solve problems with LL constraints (Liu et al., 2021b; Sow et al., 2022; Liu et al., 2021c) . Value-function-based methods typically transform the original problem into a single-level problem with non-convex and non-differentiable constraints. To resolve the latter issue these approaches regularize the problem by adding a stronglyconvex penalty term, altering the problem's structure. In contrast, we introduce a perturbation-based smoothing technique, which at any given x ∈ X makes y * (x) differentiable almost surely, without practically changing the landscape of the original problem (see (Lu et al., 2020, pg. 5 )). It is important to note that the value function-based approaches are more suited for deterministic implementations, and therefore it is difficult to use such algorithms for large scale applications and/or when the data sizes are large. On the other hand, the gradient-based algorithms developed in our work can easily handle stochastic problems. Finally, there is a line of work (Amos & Kolter, 2017; Agrawal et al., 2019; Donti et al., 2017; Gould et al., 2021) about implicit differentiation in deep learning literature. However, in these works the setting (e.g. layers of neural network described by optimization tasks) and the focus (e.g., on gradient computation and implementation, rather than on algorithms and analysis) is different. For more details see Appendix A.

Contributions.

In this work, we study a class of bilevel optimization problems with strongly convex objective and linear constraints in the LL. Major challenges for solving such problems are the following: 1) How to ensure that the implicit function G(x) is differentiable? and 2) Even if the implicit function is differentiable, how to compute its (approximate) gradient in order to develop first-order methods? Our work addresses these challenges and develops first-order methods to tackle such constrained bilevel problems. Specifically, our contributions are the following: -We provide an in-depth understanding of bilevel problems with strongly convex linearly constrained LL problems. Specifically, we first show with an example that the implicit objective G(x) is in general non-differentiable. To address the non-differentiability, we propose a perturbation-based smoothing technique that makes the implicit objective G(x) differentiable in an almost sure sense, and we provide a closed-form expression for the (approximate) implicit gradient. -The smoothed problem we obtain is challenging, since its implicit objective does not have Lipschitz continuous gradients. Therefore, conventional gradient based algorithms may no longer work. To address this issue, we propose the Deterministic Smoothed Implicit Gradient ([D]SIGD) method that utilizes an (approximate) line search-based algorithm and establish asymptotic convergence guarantees. We also analyze [S]SIGD for the stochastic version of problem (1) (with fixed/diminishing step-sizes) and establish finite-time convergence guarantees for the cases when the implicit function is weakly-convex, strongly-convex, and convex (but not Lipschitz smooth). -Finally, we evaluate the performance of the proposed algorithmic framework via experiments on quadratic bilevel and adversarial learning problems. Bilevel problem 1 captures several important applications. Below we provide two such applications. Adversarial Training. The problem of robustly training a model ϕ(x; c), where x denotes the model parameters and c the input to the model; let {(c i , d i )} N i=1 with c i ∈ R d ℓ i , d i ∈ R be the training set (Zhang et al., 2021; Goodfellow et al., 2014) . It can be formulated as the following bilevel problem:

