LINEARLY CONSTRAINED BILEVEL OPTIMIZATION: A SMOOTHED IMPLICIT GRADIENT APPROACH

Abstract

This work develops an analysis and algorithms for solving a class of bilevel optimization problems where the lower-level (LL) problems have linear constraints. Most of the existing approaches for constrained bilevel problems rely on value function based approximate reformulations, which suffer from issues such as nonconvex and non-differentiable constraints. In contrast, in this work, we develop an implicit gradient-based approach, which is easy to implement, and is suitable for machine learning applications. We first provide an in-depth understanding of the problem, by showing that the implicit objective for such problems is in general non-differentiable. However, if we add some small (linear) perturbation to the LL objective, the resulting implicit objective becomes differentiable almost surely. This key observation opens the door for developing (deterministic and stochastic) gradient-based algorithms similar to the state-of-the-art ones for unconstrained bi-level problems. We show that when the implicit function is assumed to be strongly-convex, convex and weakly-convex, the resulting algorithms converge with guaranteed rate. Finally, we experimentally corroborate the theoretical findings and evaluate the performance of the proposed framework on numerical and adversarial learning problems. To our knowledge, this is the first time that (implicit) gradientbased methods have been developed and analyzed for the considered class of bilevel problems.

1. INTRODUCTION

Bilevel optimization problems (Colson et al., 2005; Dempe & Zemkoho, 2020) can be used to model an important class of hierarchical optimization tasks with two levels of hierarchy, the upper-level (UL) and the lower-level (LL). The key characteristics of bilevel problems are: 1) the solution of the UL problem requires access to the solution of the LL problem and, 2) the LL problem is parametrized by the UL variable. Bilevel optimization problems arise in a wide range of machine learning applications, such as meta-learning (Rajeswaran et al., 2019; Franceschi et al., 2018) , data hypercleaning (Shaban et al., 2019) , hyperparameter optimization (Sinha et al., 2020; Franceschi et al., 2018; 2017; Pedregosa, 2016) , adversarial learning (Li et al., 2019; Liu et al., 2021a; Zhang et al., 2021) , as well as in other application domains such as network optimization (Migdalas, 1995) , economics (Cecchini et al., 2013) , and transport research (Didi-Biha et al., 2006; Kalashnikov et al., 2010) . In this work, we focus on a special class of stochastic bilevel optimization problems, where the LL problem involves the minimization of a strongly convex objective over a set of linear inequality constraints. More precisely, we consider the following formulation: min x∈X G(x) := f (x, y * (x)) := E ξ [ f (x, y * (x); ξ)] , s.t. y * (x) ∈ arg min y∈R d ℓ h(x, y) Ay ≤ b , where ξ ∼ D represents a stochastic sample of the objective f (•, •), X ⊆ R du is a convex and closed set, f : X × R d ℓ → R is the UL objective, h : X × R d ℓ → R is the LL objective, and f, h are smooth functions. We focus on the problems where h(x, y) is strongly convex with respect to y. The matrix A ∈ R k×d ℓ , and vector b ∈ R k define the linear constraints. In the following, we refer to (1a) as the UL problem, and to (1b) as the LL one. The success of the bilevel formulation and its algorithms in many machine learning applications can be attributed to the use of the efficient (stochastic) gradient-based methods (Liu et al., 2021a) . These methods take the following form, in which an (approximate) gradient direction of the UL problem is computed (using chain rule), and then the UL variable is updated using gradient descent (GD): ∇G(x) ≈ ∇ x f (x, y * (x)) + [∇y * (x)] T ∇ y f (x, y * (x)) GD Update:x + = x -β ∇G(x). (2) The gradient of G(x) is often referred as the implicit gradient. However, computing this implicit gradient not only requires access to the optimal y * (x), but also assumes differentiability of the mapping y * (x) : X → R d ℓ . One can potentially solve the LL problem approximately and obtain an approximation y(x) such that y(x) ≈ y * (x), and use it to compute the implicit gradient (Ghadimi & Wang, 2018) . Unfortunately, not all solutions y * (x) are differentiable, and when they are not the above approach cannot be applied. It is known that when the LL problem is strongly convex and unconstrained, then ∇y * (x) can be easily evaluated using the implicit function theorem (Ghadimi & Wang, 2018) . This is the reason that the majority of recent works have focused on developing algorithms for the class of unconstrained bilevel problems (Ghadimi & Wang, 2018; Hong et al., 2020; Ji et al., 2021; Khanduri et al., 2021b; Chen et al., 2021a) . However, when the LL problem is constrained, ∇y * (x) might not even exist. In that case, most works adopt a value function-based approach to solve problems with LL constraints (Liu et al., 2021b; Sow et al., 2022; Liu et al., 2021c) . Value-function-based methods typically transform the original problem into a single-level problem with non-convex and non-differentiable constraints. To resolve the latter issue these approaches regularize the problem by adding a stronglyconvex penalty term, altering the problem's structure. In contrast, we introduce a perturbation-based smoothing technique, which at any given x ∈ X makes y * (x) differentiable almost surely, without practically changing the landscape of the original problem (see (Lu et al., 2020, pg. 5 )). It is important to note that the value function-based approaches are more suited for deterministic implementations, and therefore it is difficult to use such algorithms for large scale applications and/or when the data sizes are large. On the other hand, the gradient-based algorithms developed in our work can easily handle stochastic problems. Finally, there is a line of work (Amos & Kolter, 2017; Agrawal et al., 2019; Donti et al., 2017; Gould et al., 2021) about implicit differentiation in deep learning literature. However, in these works the setting (e.g. layers of neural network described by optimization tasks) and the focus (e.g., on gradient computation and implementation, rather than on algorithms and analysis) is different. For more details see Appendix A.

Contributions.

In this work, we study a class of bilevel optimization problems with strongly convex objective and linear constraints in the LL. Major challenges for solving such problems are the following: 1) How to ensure that the implicit function G(x) is differentiable? and 2) Even if the implicit function is differentiable, how to compute its (approximate) gradient in order to develop first-order methods? Our work addresses these challenges and develops first-order methods to tackle such constrained bilevel problems. Specifically, our contributions are the following: -We provide an in-depth understanding of bilevel problems with strongly convex linearly constrained LL problems. Specifically, we first show with an example that the implicit objective G(x) is in general non-differentiable. To address the non-differentiability, we propose a perturbation-based smoothing technique that makes the implicit objective G(x) differentiable in an almost sure sense, and we provide a closed-form expression for the (approximate) implicit gradient. -The smoothed problem we obtain is challenging, since its implicit objective does not have Lipschitz continuous gradients. Therefore, conventional gradient based algorithms may no longer work. To address this issue, we propose the Deterministic Smoothed Implicit Gradient ([D]SIGD) method that utilizes an (approximate) line search-based algorithm and establish asymptotic convergence guarantees. We also analyze [S]SIGD for the stochastic version of problem (1) (with fixed/diminishing step-sizes) and establish finite-time convergence guarantees for the cases when the implicit function is weakly-convex, strongly-convex, and convex (but not Lipschitz smooth). -Finally, we evaluate the performance of the proposed algorithmic framework via experiments on quadratic bilevel and adversarial learning problems. Bilevel problem 1 captures several important applications. Below we provide two such applications. Adversarial Training. The problem of robustly training a model ϕ(x; c), where x denotes the model parameters and c the input to the model; let {(c i , d i )} N i=1 with c i ∈ R d ℓ i , d i ∈ R be the training set (Zhang et al., 2021; Goodfellow et al., 2014) . It can be formulated as the following bilevel problem: min x∈R du N i=1 f i (ϕ(x; c i + y * i (x)), d i ) s.t. y * (x) ∈      arg min yi∈R d ℓ i N i=1 h i (ϕ(x; c i + y i ), d i ) s.t. -b ≤ y ≤ b , where y = [y T 1 , . . . , y T N ] T ∈ R d ℓ ; with y i ∈ R d ℓ i denotes the attack on the i th example and we have N i=1 d ℓi = d ℓ . Moreover, f i : R × R → R denotes the loss function for learning the model parameter x, while h i : R × R → R denotes the adversarial objective used to design the optimal attack y. Note that the linear constraints in the LL problem -b ≤ y ≤ b models the attack budget. Distributed Optimization. In distributed optimization (Chang et al., 2020; Yang et al., 2019) , a set of N agents aim to jointly minimize an objective function G(x) over an undirected graph G = (V, E). We consider the following distributed bilevel problem min {xi∈X |Ax=0} G(x) := N i=1 f i (x i , y * i (x i )) s.t. y * (x) ∈ arg min y∈R d ℓ N i=1 h i (x i , y i ) s.t. Ay = 0 , where x = [x 1 , . . . , x N ] and y = [y 1 , . . . , y N ]. Each agent i ∈ [N ] has access to f i and h i . The constraint Ay = 0 (resp. Ax = 0) is introduced to ensure the consensus of LL (resp. UL) variables. Such problems arise in signal processing and sensor networks (Yousefian, 2021) . This formulation also models a decentralized meta learning problem where the training and validation data is distributed among agents while each agent aims to solve the meta learning problem globally (Ji et al., 2021) .

2. PROPERTIES AND IMPLICIT GRADIENT OF BILEVEL PROBLEM (1)

2.1 PRELIMINARIES In this section we study the properties of problem 1. First, let us define the necessary notations. Let A(y) be the matrix that contains the rows S(y) ⊆ {1, . . . , k} of A that correspond to the active constraints of inequality Ay ≤ b in the LL problem, that is we have A(y)y = b(y), where b(y) contains the elements of b with indices in S(y). Also, we denote with λ * (x) the Lagrange multipliers vector that corresponds to the active constraints at y * (x) Next, we introduce some basic assumptions. Assumption 1. We assume that the following conditions hold for problem (1): (a) f (x, y) is continuously differentiable, and h(x, y) is twice continuously differentiable. (b) X is closed and convex; Y = y ∈ R d ℓ Ay ≤ b is a compact set. (c) h(x, y) is strongly convex in y, for every x ∈ X , with modulus µ h . (d) There exists y ∈ R d ℓ such that Ay < b. (e) A(y * (x)) is full row rank, for every x ∈ Xfoot_0 . The Assumptions 1(a), (b) and (c) are standard assumptions in bilevel optimization literature and are required to ensure the continuity of the implicit function (Proposition 1). Assumption 1(c) ensures that the implicit function G(x) is well defined as the LL problem returns a single point. Assumption 1(d) ensures strict feasibility of the LL problem, while Assumption 1(e) implies that the rows of A corresponding to the active constraints are linearly independent. Note that this assumption is necessary to ensure the differentiability of the implicit function (Lemma 1,2). Also note that there are some special cases in which Assumption 1(e) is automatically satisfied. For instance, consider a problem where the LL problem has box constraints, i.e., a ≤ y ≤ b. Then for any y ∈ Y the only possible non-zero values in the matrix A(y) are +1, -1, and there is only one non-zero value at each column. Therefore, A(y) is full row rank. Next, we utilize the above assumptions to analyze the properties of mapping y * (x). Proposition 1 (Appendix D.1.1). Under Assumption 1, the mapping y * (x) : X → R d ℓ and the implicit function G(x) are both continuous. Proposition 1 ensures that y * (x) and G(x) are both continuous. Now if we can ensure differentiability of y * (x), then we should be able to implement a gradient-based update rule to solve (1). However, as the following example illustrates, y * (x) and thus G(x) are not differentiable in general. Example. Consider the following problem The mapping y * (x) is y * (x) = x, if x ∈ [ 3/5, 1], and y * (x) = 3/5, if x ∈ [0, 3/5). In Figure 1 , we plot this mapping. Notice that at the point x = 3/5 the mapping (and thus the implicit function) is non-differentiable. min x∈[0,1] x + y * (x) s.t. y * (x) ∈ arg min y∈R (y 2 -x 2 ) 2 3/5 ≤ y ≤ 1 . ( To address the non-differentiability issue, we introduce a perturbation-based "smoothing" technique. Specifically, for any fixed x ∈ X , we modify the LL objective h(x, y) with the addition of the linear perturbation term q T y, where q is a random vector sampled from some continuous distribution Q. We use the following notation for the "smoothed" LL objective g(x, y) := h(x, y) + q T y and y * (x) := arg min y∈R d ℓ {g(x, y) | Ay ≤ b}. (5) Also, we denote F (x) := f (x, y * (x)) as the respective "smoothed" implicit function. Such a perturbation is used to ensure that at a given x ∈ X , the strict complementarity (SC) property holds for the LL problem with probability 1 (w.p. 1); see the lemma below for the formal statement. Lemma 1. ( (Lu et al., 2020 , Proposition 1)) For a given x ∈ X , if y * (x) is a KKT point of problem min y∈R d ℓ g(x, y), q is generated from a continuous measure, and A(y * (x)) is full row rank, then the SC condition holds at x, with probability 1 (w.p. 1). Combining SC ensured by Lemma 1 with Assumption 1, we can show that the implicit mapping y * (x) is (almost surely) differentiable, which further implies that the implicit function F (x) is differentiable at a given x ∈ X , and obtain a closed-form expression for the gradient (Lemma 2). We would like to stress that the properties mentioned above (i.e., SC and differentiability) are defined locally, at a given point x ∈ X . These properties will be used later to design algorithms that approximately optimize the original problem (1). Finally, it is worth noting that, in the absence of such a perturbation term, we would have to introduce the SC property as an assumption.

2.2. IMPLICIT GRADIENT

In this section, we derive a closed-form expression for the gradient of the implicit function F (x). Lemma 2 (Implicit Gradient, Appendix D.1.2). Under Assumption 1, for any given x ∈ X , we have ∇y * (x) = ∇ 2 yy g(x, y * (x)) -1 -∇ 2 xy g(x, y * (x)) -A T ∇λ * (x) (6) ∇λ * (x) = -A ∇ 2 yy g(x, y * (x)) -1 A T -1 A ∇ 2 yy g(x, y * (x)) -1 ∇ 2 xy g(x, y * (x) ) , (7) where we set A := A(y * (x)). Note that when LL problem (1b) does not have the LL constraints, the implicit gradient derived in Lemma 2 becomes exactly same as the one in Ghadimi & Wang (2018) ; Ji et al. (2021) . Moreover, if the LL problem has only linear equality constraints, the differentiability of y * (x) follows from the implicit function theorem under Assumptions 1(a) and 1(c) along with full row rankness of A. In fact, the expression of the implicit gradient stays the same as in Lemma 2 with A and λ * (x) replaced by A and λ * (x), respectively (i.e., we use the full matrix A). Finally, using Lemma 2 above we now have an expression of the implicit gradient as ∇F (x) = ∇ x f (x, y * (x)) + [∇y * (x)] T ∇ y f (x, y * (x)). (8)

2.2.1. APPROXIMATE IMPLICIT GRADIENT

Note that computing ∇F (x) requires the precise knowledge of y * (x) which is not possible for many problems of interest. Therefore, in practice we define the approximate implicit gradient as ∇F (x) = ∇ x f (x, y(x)) + [ ∇y * (x)] T ∇ y f (x, y(x)), where ∇y * (x) is defined by setting the approximate LL solution y(x) in place of the exact one y * (x) in expressions ( 6) and ( 7). In order to ensure that (9) returns a useful approximation of the (exact) implicit gradient, we impose a few assumptions on the quality of the estimate y(x). Assumption 2. The approximate solution of (perturbed) LL problem (5) y(x) satisfies the following ∀x ∈ X : (a) ∥ y(x) -y * (x)∥ ≤ δ for δ > 0, (b) y(x) is a feasible point, i.e., A y(x) ≤ b, (c) It holds that A(y * (x)) = A( y(x)). The LL problem requires the solution of a strongly convex linearly constrained task. As a result, Assumptions2(a),(b) can be easily satisfied. Specifically, we can obtain approximate feasible solutions of given accuracy with known methods, such as projected gradient descent, or by using some convex optimization solver; in section B of the Appendix we provide one such method. Moreover, Assumption 2(c) will be satisfied if we find a "sufficiently accurate" solution y(x). Specifically, from Calamai & Moré (1987, Theorem 4 .1) we know that if y k (x) ∈ Y is an arbitrary sequence that converges to a non-degenerate (i.e., Assumption 1(e) and SC holds) stationary solution y * (x), then there exists an integer k 0 such that A(y * (x)) = A( y k (x)), ∀k > k 0 . Remark 1. There are certain special cases where we can obtain an upper bound for k 0 . For instance, in the case of non-negative constraints y ≥ 0 it can be shownfoot_1 that L h µ h log 2L h ∥y 0 -y * (x)∥/τ iterations of the projected gradient descent method suffice to ensure that the active set of the approximate solution y(x) coincides with the active set of the exact one y * (x) (see Nutini et al. (2019, Corollary 1)) , where τ = min i∈S(y * (x)) ∇ yi g(x, y * (x)) and y 0 is the algorithm's initialization. A similar result can be derived for the case with bound constraints a ≤ y ≤ b. Next, we introduce additional assumptions that are required to analyze the properties of (9). Assumption 3. We assume that the following holds for problem (1), ∀x, x ∈ X and y, y ∈ R d ℓ : (a) f has bounded gradients, i.e., ∥∇f (x, y)∥ ≤ L f . ( (Ghadimi & Wang, 2018; Hong et al., 2020; Chen et al., 2021a; Ji et al., 2021) and is used to derive some useful properties of the (approximate) implicit gradient (Lemma 3, Appendix D.1.3) . It is easy to see that Assumptions 1(a),(c) and 3 hold directly for the perturbed objective (5) with constants µ g = µ h , L h = L g , L gyy = L hyy , L gxy = L hxy , L gxy = L hxy ; we also assume that Assumption 1(e) holds for the perturbed LL problem (5). Lemma 3 (Appendix D.1.3). Suppose that Assumptions 1,2,3 hold. Then, for every x ∈ X the following holds ∥ ∇F (x) -∇F (x)∥ ≤ L F • δ, ∥∇F (x)∥ ≤ L F and ∥ ∇F (x)∥ ≤ L F , where L F = L f + L y * L f + L f L y * , and L F = 1 + L y * L f ; the constants L y * , L y * are defined in Lemmas 7,9, respectively, provided in the Appendix.

2.2.2. STOCHASTIC IMPLICIT GRADIENT

In the stochastic setting, the (approximate) stochastic implicit gradient is computed as: ∇F (x; ξ) = ∇ x f (x, y(x); ξ) + [ ∇y * (x)] T ∇ y f (x, y(x); ξ). (10) Also, we make the following assumption on the stochastic gradients of the UL problem. Assumption 4. We assume that the stochastic gradients are unbiased, i.e. E ξ [∇ f (x, y; ξ)] = ∇f (x, y) and have bounded variance, i.e., E ξ ∥∇ f (x, y; ξ) -∇f (x, y)∥ 2 = σ 2 f for some σ f > 0. Algorithm 1 [Deterministic] Smoothed Implicit Gradient Descent ([D]SIGD) 1: Input: Initial parameter x 0 , # of iteration T , LL solution accuracy, δ r , σ, measure Q, s 2: Sample q ∼ Q and perturb LL problem 3: for r = 0, 1, . . . , T -1 do 4: Find an approximate solution y(x r ) s.t. Assumption 2 is satisfied.

5:

Compute ∇F (x r ) using ( 9), d r = x rx r with x r = proj X (x r -s ∇F (x r )) 6: Select a r s.t. the following Armijo-type rule condition is satisfied F (x r ) -F (x r + a r d r ) ≥ -σ • a r [ ∇F (x r )] T d r -ϵ(δ; r) where ϵ(δ; r) depends on δ r , α r and problem-dependent parameters; F (•) = f (•, y(•)) .

7:

Perform one projected gradient step: x r+1 = x r + a r • d r 8: end for Algorithm 2 [Stochastic] Smoothed Implicit Gradient Descent ([S]SIGD) 1: Input: Initial parameter x 0 , # of iterations T , step-sizes {β r } T -1 r=0 , LL solution accuracy δ 2: Sample q ∼ Q and perturb LL problem 3: for r = 0, 1, . . . , T -1 do 4: Find an approximate solution y(x r ) s.t. Assumption 2 is satisfied.

5:

Compute ∇F (x r ; ξ r ) using ( 10) 6: Perform one stochastic projected gradient descent step: x r+1 = proj X (x r -β r ∇F (x r ; ξ r )) 7: end for Assumption 4 is a typical assumption required to ensure that the approximate implicit stochastic gradient is also unbiased and has finite variance (Ghadimi & Wang, 2018; Hong et al., 2020; Chen et al., 2021a) as shown in Lemma 4 below. Lemma 4 (Appendix D.1.4). Under Assumptions 1,2,3 and 4, the stochastic gradient estimate in (10) is unbiased, i.e., E ξ [ ∇F (x; ξ)] = ∇F (x) and has bounded variance , i.e., E ξ ∥ ∇F (x; ξ) -∇F (x)∥ ≤ σ 2 F where σ 2 F = 2σ 2 f + 2L y * σ 2 f ; where L y * is defined in Lemma 7 in the Appendix.

3.1. THE PROPOSED ALGORITHMS

In this section, we develop gradient-based methods for solving problem (1) by leveraging the smoothing based technique introduced in the previous section. Recall that for any x ∈ X , we can introduce a perturbation to make the optimal solution y * (x) of the perturbed LL problem differentiable. Next, to proceed with the algorithm design, there are two options available. First, generate a perturbation for each x ∈ X encountered in the algorithm. Second, generate a single perturbation at the beginning of the algorithm, and use it throughout the execution of the algorithm. It is worth mentioning that, both approaches perform equally well in our numerical experiments. However, for the ease of analysis, we adopt the second approach. To justify such an approach, below we show that, just sampling a single q is suffice to make F (x) differentiable (almost surely) at a sequence of countable points. Lemma 5 (Appendix D.2.1). Let {x r } ∞ r=0 ∈ X be an arbitrary sequence. Consider the implicit function F (x) := f (x, y * (x)), where y * (x) is defined in (5), where a single perturbation q is used. Then F (•) is differentiable at all the points {x r } ∞ r=0 w.p. 1. Due to this result, in the following analysis, we assume that the almost sure differentiability will also be satisfied for the iterates generated by our algorithms as suggested by Lemma 5. Further, our algorithm design is guided by the fact that, unlike bilevel programs with unconstrained LL tasks (see Lemma 2.2(c) in Ghadimi & Wang (2018) ), the implicit gradient ∇F (x) in ( 8) is not Lipschitz smooth in general. This implies that algorithms that provably converge only under the Lipschitz assumption, will not work in our case, particularly when the implicit function is non-convex. Towards this end, we propose the [Deterministic] Smoothed Implicit Gradient Descent ([D]SIGD) method (Alg. 1), a determinstic line-search-based method, which does not require Lipschitz smoothness or another special structure (e.g., convexity), and show asymptotic convergence (Theorem 1). Moreover, for the cases where the implicit function is weakly-convex, convex or strongly convex (but still not Lipschitz smooth) the [Stochastic] Smoothed Implicit Gradient Descent ([S]SIGD) method (Alg. 2) is developed, a stochastic gradient-based method, for which finite-time convergence guarantees are derived (Theorem 2,3,4).

3.2. CONVERGENCE ANALYSIS

As discussed above, in the context of algorithm design and analysis we sample a single perturbation q, and keep it fixed during the algorithm execution. As a result, the algorithm is effectively optimizing the following smooth surrogate of the original problem (1): min x∈X F (x) = f (x, y * (x)) = E ξ [ f (x, y * (x); ξ)] (12a) s.t. y * (x) ∈ arg min y∈R d ℓ g(x, y) = h(x, y) + q T y Ay ≤ b , where q ∈ R d ℓ is generated from a continuous measure only once and thus is considered fixed. Next, we show that the original problem (1) and the smoothed surrogate problem ( 12) are "close". Specifically, we show below that the original implicit function G(x) and the "smoothed" implicit function F (x) differ by a quantity that is controlled by the size of the perturbation vector q. Proposition 2 (Appendix D.2.2). Under Ass. 1 and 3, we have: |G(x) -F (x)| ≤ L f ∥q∥ µg , ∀ x ∈ X . Note that the only requirement on q is that it is generated from a continuous measure. Therefore we can always choose a distribution such that ∥q∥ is arbitrarily small. Next, let us analyze Alg. 1. We have the following asymptotic result. Theorem 1 (Appendix D.2.3). Suppose that Ass. 1, 3 hold. At each iteration r of Alg. 1 we find 0 < a r < 1 such that the Armijo-type condition (11 ) is satisfied with ϵ(δ; r) = L f δ r + L F L F a r δ r + L f δ r+1 + L 2 F σa r (δ r ) 2 + 2L F L F σa r δ r . Further, we select δ r such that Ass. 2 is satisfied, lim r→∞ δ r = 0, and it holds that δ r /a r ∼ O (c r ), where c r is some sequence with lim r→∞ c r = 0. In addition, the sequence d r is selected such that it is gradient related to ∇F (x r ), i.e., "for any subsequence {x r } r∈R converging to a non-stationary point, the corresponding subsequence { d r } r∈R is bounded and satisfies lim sup r→∞,r∈R ∇F (x r ) T d r < 0" (Bertsekas, 1998, eq. 1.13 ). Then w.p. 1 the limit point x of the sequence of iterates generated by the [D]SIGD Alg. 1 is a stationary point. Note that in Theorem 1 only asymptotic convergence is guaranteed. However, this is the best we can do since we do not impose any Lipschitz smoothness or convexity assumptions. On the other hand, in the special cases where the implicit function is weakly-convex, strongly convex or convex (but still not Lipschitz smooth), it is possible to derive finite-time convergence guarantees as presented next. Towards this end, we need to impose the additional assumption that the set X is bounded; combining this property with Assumption 1 implies that X is compact. So, in the following results we assume that X is a compact set with diameter D X := sup x,x∈X ∥x -x∥. Weakly Convex Objective. We make the following assumption on the implicit function F (•). Assumption 5. We assume that for some ρ > 0 the implicit function F (x) satisfies: F (z) ≥ F (x) + ⟨∇F (x), z -x⟩ -ρ 2 ∥z -x∥ 2 ∀ x, z ∈ R du . Assumption 5 implies that the function F (x) + ρ 2 ∥x∥ 2 for ρ = ρ is convex while for ρ > ρ is strongly convex with modulus ρ -ρ. Many problems of practical interest satisfy the weak-convexity, for example, phase retrieval (Davis et al., 2020) , covariance matrix estimation (Chen et al., 2015) , dictionary learning (Davis & Drusvyatskiy, 2019 ), Robust PCA (Candès et al., 2011) etc. (please see (Davis & Drusvyatskiy, 2019) and (Drusvyatskiy, 2017) for more details). For providing guarantees for the [S]SIGD algorithm we utilize a Moreau envelope based analysis. For this purpose, we first rephrase the UL problem as an unconstrained one: min x∈R du H(x) := F (x) + I X (x), where I X (x) is the indicator function of set X defined as: I X (x) := 0 if x ∈ X and I X (x) := ∞ if x / ∈ X . Below we define the Moreau envelope of H(x). Definition 1. Given λ > 0, the Moreau envelope of H(x) is defined as H λ (x) := min z∈R du H(z) + 1 2λ ∥x -z∥ 2 = min z∈X F (z) + 1 2λ ∥x -z∥ 2 , where the second equality follows from the definition of H(x). Moreover, we denote the proximal map of H(x) as x := prox λH (x) which is defined as x := arg min z∈R du H(z) + 1 2λ ∥x -z∥ 2 = arg min z∈X F (z) + 1 2λ ∥x -z∥ 2 . The norm of the gradient of the Moreau envelope satisfies the following: ∥x -x∥ = λ∥∇H λ (x)∥, H(x) ≤ H(x), and dist(0; ∂H(x)) ≤ ∥∇H λ (x)∥, where dist(0; ∂H(x)) = -inf v:∥v∥≤1 H ′ (x; v) and H ′ (x; v) denotes the directional derivative of H at x in direction v. Note that a small gradient ∥∇H λ (x)∥ implies that x is near some point x that is nearly stationary (Davis & Drusvyatskiy, 2019) . Then we have the following result. Theorem 2 (Appendix D.2.4). Under Ass. 1, 2, 3, 4 and 5, with step-sizes β r = β for all r ∈ {0, . . . , T -1} and for any constant ρ > 3ρ 2 , the iterates generated by Algorithm 2 satisfy (w.p. 1) 1 T T -1 r=0 E∥∇H 1/ ρ(x r )∥ 2 ≤ 2ρ 2ρ -3ρ H 1/ ρ(x 0 ) -H * βT + β ρ σ 2 F + L 2 F + ρ 2ρ L 2 F δ 2 . Theorem 2 implies that with the choice of Strongly Convex and Convex Objective. Next, we provide the guarantees for the case when the implicit function is strongly convex. We make the following assumption. Assumption 6. We assume that the objective F (x) is µ F -strongly convex, i.e., F (z) ≥ F (x) + ⟨∇F (x), z -x⟩ + µ F 2 ∥x -z∥ 2 ∀z, x ∈ X . Note that for µ F = 0, the objective becomes convex. Theorem 3 (Appendix D.2.5). Under the Assumptions 1, 2, 3, 4 and 6, with µ F > 0 and the choice of step-sizes β r = 1 µ F (r+1) the iterates generated by Algorithm 2 satisfy the following (w.p. 1), β = O(1/ √ T ), E[F (x) -F * ] ≤ (σ 2 F + L 2 F ) µ F log(T ) T + D X L F δ. Theorem 4 (Appendix D.2.6). Under Assumption 1, 2, 3, 4 and 6, with µ F = 0, and step-sizes β r = β for r ∈ {0, . . . , T -1}, the iterates generated by Algorithm 2 satisfy the following (w.p. 1), E[F (x) -F * ] ≤ ∥x 1 -x * ∥ 2 βT + 2β(σ 2 F + L 2 F ) + D X L F δ, where x = 1 T T -1 r=0 x r . The results of Theorems 3 and 4 imply that the implicit function F (x) converges to the optimal value at a rate of O(1/T ) for strongly-convex objectives with diminishing step-sizes, and at a rate of O(1/ √ T ) for convex objectives with β = O(1/ √ T ). Note that convergence is shown to a neighborhood of the optimal solution where its size is determined by the size of the LL error δ.

4. EXPERIMENTS

In this section, we evaluate the performance of Algorithms 1 and 2 via numerical experiments. First, we compare the performance of [D]SIGD to the recently proposed PDBO (Sow et al., 2022) for constrained bilevel optimization on a quadratic bilevel problem. Then in the second set of experiments, we evaluate the performance of [S]SIGD against popular adversarial training algorithms. Quadratic Bilevel Optimization. Consider the quadratic bilevel problem of the form (1) with and linear constraints of the form f (x, y) = 1 4 ∥x∥ 2 + 10x T y - 1 4 ∥y∥ 2 + 1 T x + 1 T y + 1 and h(x, y) = x T y + 1 2 ∥y∥ 2 + x1 + y2, |y i | ≤ 1, i ∈ {1, 2}. Here, x = [x 1 , x 2 ] T , y = [y 1 , y 2 ] T with x i , y i ∈ R for i ∈ {1, 2} , and 1 = [1, 1] T . The evaluation criterion is the stationarity gap ∥∇F (x)∥. On this problem we execute [D]SIGD (Algorithm 1), [S]SIGD (Algorithm 2), and PDBO (Sow et al., 2022) . In the first two cases, we solve the inner-level problem using 10 steps of projected gradient descent with stepsize 10 -1 . For the stepsize of [S]SIGD, we choose β = 0.1, while in [D]SIGD we find the proper Armijo step-size by successively adapting (by increasing m) the quantity a r = (0.9) m until condition ( 11) is met. In PDBO we select 10 -1 for the stepsizes of both the primal and dual steps, and the number of inner iterations is set to 10. In Figure 2 , we plot the convergence curves for the three algorithms with respect to number of iterations; the results are averaged over 10 runs. Note that the line search [D]SIGD method outperforms the fixed step-size [S]SIGD and PDBO while [S]SIGD performs similar to PDBO. Adversarial Learning. We consider an adversarial learning problem of the form given in (3). For the perturbation we focus on the ϵ-tolerant ℓ ∞ -norm attack constraint, namely Y = {y ∈ R d ℓ ∥y∥ ∞ ≤ ϵ}, which can easily be expressed as a linear inequality constraint as in the LL problem of (3). We consider two widely accepted adversarial learning method as our baselines, namely AT (Madry et al., 2017) and TRADES (Zhang et al., 2019b) . Also, we consider two representative datasets CIFAR-10/100 (Krizhevsky et al., 2009) and adopt the ResNet-18 (He et al., 2016) model; the results for CIFAR-10 are provided in Appendix C. In particular, we studied two widely used, (Madry et al., 2017; Wong et al., 2020) attack budget choices ϵ ∈ {8/255, 16/255}. In the implementation of our [S]SIGD method, we adopt a perturbation generated by a Gaussian random vector q with variances from the following list σ 2 ∈ {2e-5, 4e-5, 6e-5, 8e-5, 1e-4, }, in order to study different levels of smoothness. Moreover, for solving the LL problem in each iteration we select a fixed batch of samples. We choose f i to be cross-entropy loss and h i = -f i + λ∥y i ∥ 2 for hyper-parameter λ > 0. For [S]SIGD, we follow the implementation of (Zhang et al., 2021) but with perturbations in the LL problem. We evaluate the robustly trained model with two metrics, namely the standard accuracy (SA) and robust accuracy (RA), where we evaluate the accuracy of the robustified model on the clean and attacked test set, respectively; the attacked set is generated using PGD-50-10 (Madry et al., 2017) (i.e., 50-step PGD attack with 10 restarts). Desirably, a well trained model possesses high RA while maintaining simultaneously the SA at a high level. Table 1 shows the performance overview of our experiments. We make the following observations. First, a low level of perturbation variance (e.g., σ 2 ∈ {2e-5, 4e-5}) in general improves both SA as well as RA, which presents an enhanced RA-SA trade-off. For example, in the setting (CIFAR-100, ϵ = 16/255), our algorithm boosts the RA by over 0.3% and the SA by 2%. Second, a high level of perturbation variance harms the robustness but results in high SA. This is reasonable, since the stochastic gradient becomes too noisy with large variances. Third, our method outperforms AT and closely matches the performance of the stronger baseline TRADES. However, we would like to stress that the intent of our work is not to design a specialized adversarial learning method, and thus robustness gap between our method and the strong baseline does not diminish the value of our method. Additional details are provided in Appendix C.2. Von, 1952) , find applications in a multitude of areas, including machine learning (Liu et al., 2021a) , economics (Mirrlees, 1999), power systems (Abedi et al., 2021; Arias et al., 2008) , chemical industry (Raghunathan & Biegler, 2003) , transport research (Didi-Biha et al., 2006; Kalashnikov et al., 2010) ; see (Colson et al., 2005; Dempe & Zemkoho, 2020; Sinha et al., 2017; Liu et al., 2021a) for a number of survey papers. The "classical" approaches for solving bilevel problems include the use of approximate descent methods (Shaban et al., 2019; Ghadimi & Wang, 2018; Franceschi et al., 2017) , penalty methods (Lin et al., 2014) , KKT reformulations-based approaches (Allende & Still, 2013) , value function-based methods (Ye & Zhu, 1995; Sow et al., 2022) , and trust-region algorithms (Marcotte et al., 2001) . In addition, bilevel problems are known to be related to mathematical programs with equilibrium constraints (MPEC) (Luo et al., 1996) . Recently, motivated by machine learning applications, gradient-based approaches have gained popularity for solving bilevel optimization problems (Liu et al., 2021a ), e.g., in hyperparameter optimization (Shaban et al., 2019; Franceschi et al., 2017; 2018) , and meta learning (Rajeswaran et al., 2019; Franceschi et al., 2018) . The majority of those works are focused on solving bilevel problems with unconstrained strongly convex LL problem, for both stochastic and deterministic objectives (Ghadimi & Wang, 2018; Hong et al., 2020; Khanduri et al., 2021a; b; Chen et al., 2021a; Ji et al., 2021; Chen et al., 2021b; Yang et al., 2021 ). An attractive property of such problems is the existence and easy computability of the implicit gradient. Moreover, under mild assumptions, the implicit gradient for these problems can be shown to be Lipschitz smooth (e.g., see (Ghadimi & Wang, 2018 , Lemma 2.2) and (Khanduri et al., 2021b , Lemma 3.1)). In contrast, for bilevel problems with linear LL constraints the implicit gradient in general might not exist, and even if it exists computing it in closed-form is a challenging task. As discussed earlier, we develop a perturbation-based smoothing framework for the constrained LL problem that ensures the existence of the implicit gradient in an almost sure sense, and allows us to compute an expression for the implicit gradient. In Liu et al. (2021c) and Sow et al. (2022) the authors have considered bilevel optimization with (general) constraints in the LL problem. Both papers develop a value function-based framework that leads to a single level problem with non-convex constraints. In Liu et al. (2021c) a sequential minimization approach is followed where the value-function and the LL constraints are incorporated into the objective using penalty or barrier functions. In Sow et al. (2022) a primal-dual-based framework is proposed in which the problem is regularized with the addition of a strongly-convex penalty term, while a constant error term is added to make the constraint set strictly feasible. In contrast, our approach relies only on a small linear perturbation which can be made arbitrarily small without practically changing the landscape of the LL problem. There is also a line of works (Amos & Kolter, 2017; Agrawal et al., 2019; Donti et al., 2017; Gould et al., 2021) about implicit differentiation in deep learning literature. These works Deep-Learning-type (DL-type) are indeed related to ours, in the sense that at the core of both of them lies the computation of the gradient/Jacobian of the solution of an optimization problem. However, there are some key differences. First, in our work we consider a constrained bilevel optimization problem and we are interested in analyzing this problem from an optimization perspective. On the other hand, in the DL-type works the optimization problems that are studied describe the input-output relationships of neural networks layers and the main focus lies in deriving Jacobians for the backward pass. Secondly, in our work we study a special bilevel problem (the constraints are linear) and derive a closed form expression for the implicit gradient. On the contrary, in the DL-type works the underlying problems have more general constraints and the Jacobian is usually computed using numerical methods (e.g., solving iteratively a system of KKT equations), rather than analytically. Finally, in our work the focus is on studying the properties of the bilevel problem (e.g. differentiability, approximation errors), developing (deterministic and stochastic) algorithms, and performing a convergence analysis. On the other hand, DL-type works focus mainly on the Jacobian computation and its implementation. Finally, there is a number of works on implicit differentiation on non-smooth problems (Mairal et al., 2011; Bertrand et al., 2021; 2020) . However, these works typically deal with special (non-smooth) LL problems, e.g., in Mairal et al. (2011) ; Bertrand et al. (2020) the non-smooth term in the LL is the ℓ1-norm, and in Bertrand et al. (2021) the non-smooth term is separable. On the contrary, in our work we are considering smooth LL problems and general linear inequality constraints.

B SOLUTION METHODS FOR THE LL PROBLEM

The LL problem is a strongly convex linearly constrained optimization task. As a result, there exist many efficient ways to find its solutions. In order to discuss about them, we consider two different classes of problems depending on the exact form of the linear constraints and the difficulty of computing the respective projection operator: 1) the projection has a closed-form solution, 2) the projection requires the solution of an optimization problem. Before we proceed, we would like to stress that the problem we are solving, i.e., the bilevel problem with linear constraints in the LL, is a very challenging one, regardless of the specific form and the exact way we approach the solution of the LL problem. In the first class of problems, where the projection can be computed in closed form, we have problems with special linear constraints. One characteristic example is box constraints, i.e. constraints of the form a ≤ y ≤ b, where the inequalities apply in a component-wise manner. These constraints appear in applications, such as adversarial learning (see the motivating applications in the main text). In this case, we can use some first-order iterative algorithm to solve the LL problem and project each iterate onto the constraint set using the closed-form expression (which only incurs a constant cost per iteration). For instance, we can use the projected gradient descent method which probably converges to the optimal solution with a linear rate. In the second class of problems, the projection operator does not possess a closed-form expression. In this case we can approach the LL problem as a convex optimization task, and solve it using some convex optimization solver (e.g. employing interior-point methods) to obtain a highly accurate solution with a complexity of O (p(d ℓ , k) log(d ℓ /ϵ)), where p(•) is some polynomial and ϵ is solution accuracy. Alternatively, as mentioned in the previous case, we can use a projected gradient descenttype method that enjoys a linear convergence rate guarantee. Differently from the previous case though the projection operator computed at each iteration requires the solution of an optimization problem. Nonetheless, the projection task we are referring to is a (strongly convex) quadratic linearly constrained problem, that is a special quadratic programming task, which is easy to solve in practice. In algorithm 3 we describe the solution of the LL using a projected gradient descent algorithm. 

C.1 NUMERICAL RESULTS

We consider the following linearly constrained quadratic bilevel problems of the form (1) with the UL and the LL objectives defined as:  f (x, y) = 1 4 ∥x∥ 2 + 5x T y - 1 4 ∥y∥ 2 , h(x, y) = 1 4 ∥x∥ 2 + 1 2 x T y + 1 4 ∥y∥ 2 (16) f (x, y) = 1 2 ∥x∥ 2 + 2x T y - 1 2 ∥y∥ 2 , h(x, y) = x T y + 1 2 ∥y∥ 2 . ( ) f (x, y) = 1 4 ∥x∥ 2 + 2x T y - 1 4 ∥y∥ 2 + 1 , h(x, y) = x T y + 1 2 ∥y∥ 2 + 1 T x + 1 T y. ( ) In the first two cases, we have d u = d l = 2, and the linear constraints in the LL are of the form -1 ≤ y i ≤ 1, i ∈ {1, 2}. In the third example, we have d u = d l = 2, and the linear constraints in the LL are of the form -5 ≤ y i ≤ 5, i ∈ {1, 2}, -5 ≤ y 1 + y 2 ≤ 5. We compare the performance of SIGD algorithms to recently proposed PDBO (Sow et al., 2022) . In Figures 3a, 3b and 3c , we present the evolution of the stationarity gap ∥∇F (x)∥ during the execution of the three algorithms, for the problems (16) ( 17) and ( 18), respectively. The results are averaged over 10 random runs, and the variance of the results across these runs is reflected on the shaded region across the convergence curves. In our experiments, we choose the step-size using the backtracking line search for [D]SIGD as stated in Algorithm 1, while for [S]SIGD we choose a constant step-size. Note that since all problems are deterministic [S]SIGD utilizes a gradient estimator with zero variance. In problem ( 16), we solve the LL problem using 10 steps of projected gradient descent with stepsize 0.1; in the case of [D]SIGD the stepsize is 1. For the stepsize of [S]SIGD, we choose β = 0.1, while in [D]SIGD we find the proper Armijo step-size by successively adapting (by increasing m) the quantity a r = (0.9) m until condition (11) is met. In PDBO we select 0.1 for the stepsizes of both the primal and dual steps, and the number of inner iterations is set to 10. In problem (17), we solve the LL problem using 20 steps of projected gradient descent with stepsize 0.1; in the case of [D]SIGD the number of steps is 10 and the stepsize is 1. For the stepsize of [S]SIGD, we choose β = 0.1, while in [D]SIGD we find the proper Armijo step-size by successively adapting (by increasing m) the quantity a r = (0.95) m until condition (11) is met. In PDBO we select 0.1 for the stepsizes of both the primal and dual steps, and the number of inner iterations is set to 20. In problem (18), we solve the LL problem (of both [D]SIGD and [S]SIGD) using 10 steps of projected gradient descent with stepsize 0.1. For the stepsize of [S]SIGD, we choose β = 0.1, while in [D]SIGD we find the proper Armijo step-size by successively adapting (by increasing m) the quantity a r = (0.9) m until condition (11) is met. In PDBO we select 0.1 for the stepsizes of both the primal and dual steps, and the number of inner iterations is set to 10.

C.2 ADVERSARIAL LEARNING

In this section, we present some additional results along with the implementation details for the adversarial learning problem. As noted earlier, we consider the adversarial learning problem of form (3). For learning the perturbation y * (x), we focus on the ϵ-tolerant ℓ ∞ -norm attack constraint, i.e., Y = {y ∈ R d ℓ ∥y∥ ∞ ≤ ϵ}. Note that this constraint can easily be expressed as a linear inequality constraint as in the LL problem in (3). In particular, we evaluate the performance of [S]SIGD on two widely used attack budget choices of ϵ ∈ {8/255, 16/255} (Madry et al., 2017; Zhang et al., 2019b; Wong et al., 2020; Andriushchenko & Flammarion, 2020; Zhang et al., 2019a) . In the implementation of our [S]SIGD method, we adopt a perturbation generated by a Gaussian random vector q with variances from the following list σ 2 ∈ {2e-5, 4e-5, 6e-5, 8e-5, 1e-4, }, in order to study different levels of smoothness. We choose f i to be cross-entropy loss and h i = -f i + λ∥y i ∥ 2 with λ > 0 as a hyper-parameter. For solving (3), in each iteration we select a fixed batch of samples for both the UL and LL problems. Also, note that the ReLU-based neural networks commonly lead to a piece-wise linear decision boundary w.r.t. the inputs (Moosavi-Dezfooli et al., 2019) . This implies that the implicit gradient in (10) can be further approximated using a Hessian-free implementation, where the Hessian of the LL problem can be approximated by λI (Zhang et al., 2021, Eq. (25) ). Note that these approximations are common in practice and do not lead to performance degradation compared to the case when full Hessian is used to compute the implicit-gradient (Zhang et al., 2021, Table 5 ). Next, we analyze the effect of adding different perturbations q in the LL problem on the performance of [S]SIGD. Specifically, we choose q ∼ N (0, σ 2 I) and evaluate the performance of [S]SIGD with σ 2 . In Figure 4 , we plot the robust accuracy (RA) and the standard accuracy (SA) with respect to the variance of the Gaussian perturbation vector used in the LL problem. As can be seen, the RA increases as the variance increases within a certain range. However, with stronger noise (i.e., σ 2 > 10 -4 ), the RA drops sharply, while the SA increases. This is reasonable, since high variance makes the true LL gradient noisy. For easier observation, in Figure 5 we zoom in the part of Figure 4 where σ 2 ∈ [0, 8 • 10 -5 ]. It can be clearly seen that adding a small perturbation q helps in improving the RA. Next, we compare the performance of [S]SIGD against two widely accepted adversarial learning methods as baselines, namely AT (Madry et al., 2017) and TRADES (Zhang et al., 2019b) . Here, we present the results for CIFAR-10 dataset (Krizhevsky et al., 2009) and adopt the ResNet-18 (He et al., 2016) . In Table 2 , we compare the performance of [S]SIGD for different perturbation variances with classical AT (Madry et al., 2017) algorithm and TRADES (Zhang et al., 2019b) . Note that for appropriate choice of perturbation variance [S]SIGD outperforms the classical AT algorithm while performs is only slightly worse compared to TRADES, especially, for higher attack budget of ϵ = 16/255. Note that the goal of Proposition 1 is to establish the continuity of the mapping y * (x) and the implicit function G(x) := f (x, y * (x)). In the following, we will show that under Assumption 1, y * (x) is in fact continuous, which will then utilize to establish the continuity of G(x). Before starting the proof we need a few definitions. Consider the LL problem (1b) and let us denote the set Y = {y ∈ R d ℓ |Ay ≤ b}. Note that in general the constraint set Y can depend on the UL variable x ∈ X . For such cases, Y(x) is a set valued map Y : X → R d ℓ and is referred to as a correspondence. However, for the bilevel problem in (1a) and (1b) the correspondence Y is independent of x ∈ X and is a fixed set. Also, we define the upper-semi continuity (USC) and the lower-semi continuity (LSC) for the correspondence Y(x). To define these notions of continuity, we will utilize the notion of an ϵ-ball defined below. Definition 2 (ϵ-Ball). For Y ⊂ R d ℓ , and given ϵ > 0, we define the open ball about Y as B ϵ (Y) := y ∈ R d ℓ ∥y -y ′ ∥ < ϵ, for some y ′ ∈ Y , where ∥ • ∥ is the standard Euclidean norm. Using the ϵ-ball we define the Upper Semi-Continuity (USC) of the correspondence Y. Definition 3 (Upper Semi-Continuity (USC)). The correspondence Y : X → R d ℓ is USC if for every x ∈ X and ϵ > 0, there exists a δ > 0 such that Y(x ′ ) ⊂ B ϵ (Y(x))), if x ′ ∈ X and ∥x -x ′ ∥ < δ. Next, we define the notion of Lower Semi-Continuity (LSC). Definition 4 (Lower Semi-Continuity (LSC)). The correspondence Y : X → R d ℓ is LSC if for any sequence x n in X that converges to a point x ∈ X , and y ∈ Y(x), there exists a sequence y n such that y n ∈ Y(x n ), for all n ∈ N, and lim n→∞ y n = y. Theorem 5 (Berge's Theorem of Maximum (Lecture, 2017)). Let X ⊂ R du be a non-empty set. Also, let Y : X → R d ℓ be a correspondence such that the set Y(x) is compact and non-empty for all x ∈ X , and Y is USC and LSC. Then, if g : X × R d ℓ → R is a continuous function with y * (x) defined as y * (x) ∈ arg min y∈R d ℓ g(x, y) y ∈ Y(x) , the correspondence y * (x) is non-empty for all x ∈ X , and USC. Remark 2. If y * (x) is singleton, then USC implies the continuity of the map y * (x) : X → Y. Next, we present the proof of Proposition 1. Proof. The proof of proposition 1 follows from the application of Berge's theorem. To begin with, note that for our problem the set Y is a fixed set independent of x ∈ X . We are going to verify the conditions of Theorem 5. First, note from Assumption 1(b) that the set Y is non-empty and compact. Then, it is easy to see that Y ⊂ B ϵ (Y), for every ϵ > 0, and that implies the LSC of Y. Moreover, since the set Y is independent of x ∈ X and compact, for every sequence x n → x in X , we can always find a sequence y n → y, such that y n , y ∈ Y. Therefore, Y is LSC. Finally, using Assumption 1(a) we see that the function g(x, y) is continuous. Then, Theorem 5 implies that the set y * (x) is non-empty and the correspondence USC. Using the strong-convexity of g(x, y) with respect to y (Assumption 1(c)) we claim that y * (x) will be a singleton, and thereby a continuous mapping. Then, the continuity of y * (x) implies the continuity of G(x) := f (x, y * (x)), since the composition of two continuous functions is continuous. The proof is now complete.

D.1.2 PROOF OF LEMMA 2

Proof. In this proof we follow a reasoning similar to (Parise & Ozdaglar, 2017, Thm. 1) . However, differently from that work we consider bilevel problems rather than Nash games. To begin with, consider the Lagrangian of problem (5), i.e., L(x, y, λ) = g(x, y) + λ T (Ay -b) . Then, for some fixed x ∈ X , consider a KKT point (y * (x), λ * (x)) of ( 5), for which it holds that, • ∇ y L(x, y * (x), λ * (x)) = ∇ y g(x, y * (x)) + A T λ * (x) = 0 • [λ * (x)] T (Ay * (x) -b) = 0 • λ * (x) ≥ 0 • Ay * (x) -b ≤ 0. Now, consider the active constraints at (y * (x), λ * (x)), and to simplify notation let us set A := A(y * (x)). Using the notations defined in Section 2 and the SC property, the KKT conditions given above can be equivalently rewritten as ∇ y g(x, y * (x)) + A T λ * (x) = 0, Ay * (x) -b = 0, λ * (x) > 0, where λ * (x) is the subvector of λ * (x) that contains only the elements whose indices correspond to the active constraints at y = y * (x). Moreover, notice that the point (y * (x), λ * (x)) is unique. The uniqueness of y * (x) follows from the strong convexity of g(x, •); the uniqueness of λ * (x) results from the fact that matrix A has full row rank (which guarantees regularity, e.g., see Bertsekas (1998) ). As mentioned in section 2, the SC condition (from Lemma 1) combined with Assumption 1 implies that the mapping y * (x) is differentiable almost surely (Friesz & Bernstein, 2015, Theorem 2.22 ). As a result, at any given point x, we can consider a sufficiently small neighborhood around it, such that the active constraints A remain unchanged. Then, we can compute the gradient of ( 19) using the implicit function theorem as follows ∇ 2 xy g(x, y * (x)) + ∇ 2 yy g(x, y * (x))∇y * (x) + A T ∇λ * (x) = 0 (20) A∇y * (x) = 0. (21) Solving the (20) for ∇y * (x) yields ∇y * (x) = ∇ 2 yy g(x, y * (x)) -1 -∇ 2 xy g(x, y * (x)) -A T ∇λ * (x) , ( ) where we exploited the fact that the Hessian matrix ∇ 2 yy g(x, y * (x)) is positive definite and thus invertible. Substituting ( 22) into (21) gives A ∇ 2 yy g(x, y * (x)) -1 -∇ 2 xy g(x, y * (x)) -A T ∇λ * (x) = 0 =⇒ ∇λ * (x) = -A ∇ 2 yy g(x, y * (x)) -1 A T -1 A ∇ 2 yy g(x, y * (x)) -1 ∇ 2 xy g(x, y * (x)) . Finally, note that the KKT point y * (x) corresponds to the unique global minimum of ( 5), due to the strong convexity of g(x, •). The proof is now complete.

D.1.3 THE PROOF OF LEMMA 3

The proof of Lemma 3 requires several intermediate results which we provide below. Note that under Assumption 2(c) it holds that A(y * (x)) = A( y(x)); for simplicity we will denote these matrices as A in the derivations of this subsection. Moreover, for any given matrix A we will denote with L A the maximum value of the quantity ∥A ( y(x)) ∥, across all x ∈ X . Lemma 6. Suppose that Assumption 1,2,3 hold. Then for any x ∈ X , we have: (a) ∇ 2 yy g(x, y) -1 ≤ 1 µg , ∀y ∈ R d ℓ . (b) ∇ 2 yy g(x, y * (x)) -1 -∇ 2 yy g(x, y(x)) -1 ≤ 1 µg 2 L gyy δ. (c) A ∇ 2 yy g(x, y) -1 A T -1 ≤ L A , ∀y ∈ R d ℓ . (d) A ∇ 2 yy g(x, y * (x)) -1 A T -1 -A ∇ 2 yy g(x, y(x)) -1 A T -1 ≤ L 2 A L 2 A 1 µ 2 g L gyy δ. Proof. a) We know that g(x, y) is strongly convex in y with modulus µ g . Therefore, for any x ∈ X we have ∇ 2 yy g(x, y) ⪰ µ g I ≻ 0, ∀y ∈ R d ℓ =⇒ 0 ≺ ∇ 2 yy g(x, y) -1 ⪯ 1 µ g I, ∀y ∈ R d ℓ =⇒ ∇ 2 yy g(x, y) -1 ≤ 1 µ g , ∀y ∈ R d ℓ . b) To begin with, notice that for arbitrary square invertible matrices P, Q we have P -1 -Q -1 = P -1 (Q -P )Q -1 ≤ Q -1 (P -Q) P -1 ≤ Q -1 ∥P -Q∥ P -1 . (24) Then, using the above inequality we get ∇ 2 yy g(x, y * (x)) -1 -∇ 2 yy g(x, y(x)) -1 ≤ ∇ 2 yy g(x, y * (x)) -1 ∇ 2 yy g(x, y * (x)) -∇ 2 yy g(x, y(x)) ∇ 2 yy g(x, y(x)) -1 ≤ 1 µ g 2 L gyy ∥y * (x) -y(x)∥ ≤ 1 µ g 2 L gyy δ, where in the second inequality we used the result from Lemma 6(a) and the Lipschitz Hessian property of g in yy (Assumption 3(d)); in the third inequality we use the Assumption 2(a) for y(x). c) In our problem we have that g strongly convex in y and Lipschitz gradient in y. Thus, for any x ∈ X we have L y I ⪰ ∇ 2 yy g(x, y * ) ⪰ µ g I ≻ 0 =⇒ 0 ≺ 1 L y I ⪯ ∇ 2 yy g(x, y) -1 ⪯ 1 µ g I, ∀y ∈ R d ℓ . Also, for every x ∈ X , we have that A T z 2 = z T AA T z ≥ λ min (AA T )∥z∥ 2 , ∀z ∈ R d ℓ . Using the above two lower bound we get z T A ∇ 2 yy g(x, y) -1 A T z ≥ 1 L y λ min (AA T )∥z∥ 2 > 0, ∀z ∈ R d ℓ \ {0}, where the last inequality follows from the fact that A is full row rank which implies that λ min (AA T ) > 0, ∀x ∈ X . Since the above inequality holds for every z ∈ R d ℓ \ {0}, it ∀x ∈ X we get that A ∇ 2 yy g(x, y) -1 A T ⪰ λ min (AA T ) L y I ≻ 0 A ∇ 2 yy g(x, y) -1 A T -1 ⪯ L y λ min (AA T ) I A ∇ 2 yy g(x, y) -1 A T -1 ≤ L y λ min (AA T ) . Finally, for the given matrix A, consider the submatrix A = A( y(x)) generated by considering only the subset of its rows corresponding to the active constraints at y(x). From Assumption 1(c) we know that A( y(x)) is full row rank for every x ∈ X , and so we can ensure that λ min A ( y(x)) A ( y(x)) T > 0, ∀x ∈ X . Then, we denote with λ min the minimum value of the quantity λ min A ( y(x)) A ( y(x)) T across all x ∈ X . Therefore, we conclude that A ∇ 2 yy g(x, y) -1 A T -1 ≤ L y λ min := L A . d) Applying formula (24) with P = A ∇ 2 yy g(x, y * (x)) -1 A T , Q = A ∇ 2 yy g(x, y(x)) -1 A T we get A ∇ 2 yy g(x, y * (x)) -1 A T -1 -A ∇ 2 yy g(x, y(x)) -1 A T -1 ≤ A ∇ 2 yy g(x, y * (x)) -1 A T -1 A ∇ 2 yy g(x, y * (x)) -1 A T -A ∇ 2 yy g(x, y(x)) -1 A T A ∇ 2 yy g(x, y(x)) -1 A T -1 ≤ A ∇ 2 yy g(x, y * (x)) -1 A T -1 A ∇ 2 yy g(x, y * (x)) -1 -∇ 2 yy g(x, y(x)) -1 A T A ∇ 2 yy g(x, y(x)) -1 A T -1 ≤ A ∇ 2 yy g(x, y * (x)) -1 A T -1 A ∇ 2 yy g(x, y * (x)) -1 -∇ 2 yy g(x, y(x)) -1 A T A ∇ 2 yy g(x, y(x)) -1 A T -1 ≤ L 2 A L 2 A 1 µ g 2 L gyy δ, where in the final inequality we used the bounds derived in Lemma 6(b), 6(c), and the bound ∥A∥ ≤ L A . The proof is now complete. Now let us bound the norm of the gradients of the mappings λ * (x) and y * (x). Lemma 7. Under Assumptions 1,2,3, the gradients of the mappings λ * (x) and y * (x) satisfy the following bounds for every x ∈ X , ∥∇λ * (x)∥ ≤ L λ * , ∇λ * (x) ≤ L λ * ∥∇y * (x)∥ ≤ L y * , ∥ ∇y * (x)∥ ≤ L y * where L λ * = 1 µg L A L A L gxy and L y * = 1 µy L gxy + L A L λ * . Note that ∇λ * (x) and ∇y * (x) are obtained by substituting the estimate y(x) in place of y * (x) in the expressions ∇λ * (x) and ∇y * (x), respectively (Please see Lemma 2). Proof. From Lemma 2 we have ∇λ * (x) = -A ∇ 2 yy g(x, y * (x)) -1 A T -1 A ∇ 2 yy g(x, y * (x)) -1 ∇ 2 xy g(x, y * (x)) Then, taking the norm of this quantity we get, ∇λ * (x) = A ∇ 2 yy g(x, y * (x)) -1 A T -1 A ∇ 2 yy g(x, y * (x)) -1 ∇ 2 xy g(x, y * (x)) ≤ A ∇ 2 yy g(x, y * (x)) -1 A T -1 A ∇ 2 yy g(x, y * (x)) -1 ∇ 2 xy g(x, y * (x)) ≤ L A L A 1 µ g L gxy := L λ * , where in the last inequality we used Lemma 6(a), 6(c) and Assumption 3(f). Similarly, for ∥ ∇λ * (x)∥ we have that ∇λ * (x) = A ∇ 2 yy g(x, y(x)) -1 A T -1 A ∇ 2 yy g(x, y(x)) -1 ∇ 2 xy g(x, y(x)) ≤ L A L A 1 µ g L gxy = L λ * . Moving to the bound of ∥∇y * (x)∥, we know from Lemma 2 that the formula of the gradient of y * (x) is ∇y * (x) = ∇ 2 yy g(x, y * (x)) -1 -∇ 2 xy g(x, y * (x)) -A T ∇λ * (x) . Then, we have that ∥∇y * (x)∥ = ∇ 2 yy g(x, y * (x)) -1 -∇ 2 xy g(x, y * (x)) -A T ∇λ * (x) ≤ ∇ 2 yy g(x, y * (x)) -1 -∇ 2 xy g(x, y * (x)) -A T ∇λ * (x) ≤ 1 µ g ∇ 2 xy g(x, y * (x)) + A ∇λ * (x) ≤ 1 µ g L gxy + L A L λ * := L y * , where in the second inequality we used we used Lemma 6(a); the third inequality follows from Assumption 3(f) and the bound for ∇λ * (x) we derived above. Similarly, for ∇y(x) we can obtain the following bound ∇y(x) = ∇ 2 yy g(x, y(x)) -1 -∇ 2 xy g(x, y(x)) -A T ∇λ * (x) ≤ 1 µ g ∇ 2 xy g(x, y(x)) + A ∇λ * (x) ≤ 1 µ g L gxy + L A L λ * = L y * . The proof is now complete. In the next two results we are going to present bounds for the difference of the exact and approximate gradients of the mappings λ * (x) and ∇y * (x). Lemma 8. Suppose that Assumptions 1,2,3 hold. Then, the following bound holds ∥∇λ * (x) -∇λ * (x)∥ ≤ L λ * δ, where L λ * = 1 µg 3 L 2 A L 3 A L gyy L gxy + 1 µg L A L A L gxy + 1 µg 2 L A L A L gyy L gxy . Proof. Using the derivation of ∇λ * (x) from Lemma 2, and its approximation ∇λ * (x) where we substitute y * (x) with y(x) in the formula of the former, that is, ∇λ * (x) = -A ∇ 2 yy g(x, y(x)) -1 A T -1 A ∇ 2 yy g(x, y(x)) -1 ∇ 2 xy g(x, y(x)) , we obtain ∇λ * (x) -∇λ * (x) = A ∇ 2 yy g(x, y * (x)) -1 A T -1 A ∇ 2 yy g(x, y * (x)) -1 ∇ 2 xy g(x, y * (x)) -A ∇ 2 yy g(x, y(x)) -1 A T -1 A ∇ 2 yy g(x, y(x)) -1 ∇ 2 xy g(x, y(x)) . Below, we use the following notation in order to simplify the derivations. H(x) = A ∇ 2 yy g(x, y * (x)) -1 A T , G(x) = ∇ 2 yy g(x, y * (x)) -1 , M (x) = ∇ 2 xy g(x, y * (x)) H(x) = A ∇ 2 yy g(x, y(x)) -1 A T , G(x) = ∇ 2 yy g(x, y(x)) -1 , M (x) = ∇ 2 xy g(x, y(x)) Then, we have that ∥∇λ * (x) -∇λ * (x)∥ = H -1 (x)AG(x)M (x) -H -1 (x)A G(x) M (x) (a) ≤ H -1 (x)AG(x)M (x) -H -1 (x)AG(x)M (x) + H -1 (x)AG(x)M (x) -H -1 (x)A G(x) M (x) ≤ H -1 (x) -H -1 (x) A ∥G(x)∥ ∥M (x)∥ + H -1 (x) A G(x)M (x) -G(x) M (x) (b) ≤ H -1 (x) -H -1 (x) A ∥G(x)∥ ∥M (x)∥ + H -1 (x) A G(x)M (x) -G(x) M (x) + G(x) M (x) -G(x) M (x) ≤ H -1 (x) -H -1 (x) A ∥G(x)∥ ∥M (x)∥ + H -1 (x) A ∥G(x)∥ M (x) -M (x) + H -1 (x) A G(x) -G(x) M (x) (c) ≤ L 2 A L 2 A 1 µ g 2 L gyy δL A 1 µ g L gxy + L A L A 1 µ g L gxy δ + L A L A 1 µ g 2 L gyy δL gxy = 1 µ g 3 L 2 A L 3 A L gyy L gxy + 1 µ g L A L A L gxy + 1 µ g 2 L A L A L gyy L gxy δ. In (a) we add and subtract the term H -1 (x)AG(x)M (x) and apply the triangle inequality. In (b) we add and subtract the term G(x) M (x) and apply the triangle inequality. In (c) we use Lemma 6(d) for ∥H -1 (x) -H -1 (x)∥, the bound ∥A∥ ≤ L A , Lemma 6(a) for ∥G(x)∥, Lemma 6(c) for ∥H -1 (x)∥ and ∥ H -1 (x)∥, Assumption 3(f) for ∥M (x)∥ and ∥ M (x)∥, Assumption 3(e) for ∥M (x) -M (x)∥, and finally Lemma 6(b) for ∥G(x) -G(x)∥. The proof is now complete. Lemma 9. Suppose that Assumptions 1,2,3 hold. Then, the following bound holds ∥∇y * (x) -∇y(x)∥ ≤ L y * δ, where L y * = 1 µg 2 L gyy L gxy + 1 µg L gxy + 1 µg 2 L gyy L A L λ * + 1 µg L A L λ * . Proof. From Lemma 2 we have that ∇y * (x) = ∇ 2 yy g(x, y * (x)) -1 -∇ 2 xy g(x, y * (x)) -A T ∇λ * (x) . We can also get ∇y(x) by substituting y(x) in place of y * (x) in the above formula, i.e., ∇y(x) = ∇ 2 yy g(x, y(x)) -1 -∇ 2 xy g(x, y(x)) -A T ∇λ * (x) . Then, we have that ∇y * (x) -∇y(x) = [∇ yy g(x, y * (x))] -1 -∇ xy g(x, y * (x)) -A T ∇λ * (x) -∇ 2 yy g(x, y(x)) -1 -∇ xy g(x, y(x)) -A T ∇λ * (x) ≤ ∇ 2 yy g(x, y * (x)) -1 ∇ xy g(x, y * (x)) -∇ 2 yy g(x, y(x)) -1 ∇ xy g(x, y(x)) + ∇ 2 yy g(x, y * (x)) -1 A T ∇λ * (x) -∇ 2 yy g(x, y(x)) -1 A T ∇λ * (x) (b) ≤ ∇ 2 yy g(x, y * (x)) -1 ∇ xy g(x, y * (x)) -[∇ yy g(x, y(x))] -1 ∇ xy g(x, y * (x)) + ∇ 2 yy g(x, y(x)) -1 ∇ xy g(x, y * (x)) -∇ 2 yy g(x, y(x)) -1 ∇ xy g(x, y(x)) + ∇ 2 yy g(x, y * (x)) -1 A T ∇λ * (x) -∇ 2 yy g(x, y(x)) -1 A T ∇λ * (x) + ∇ 2 yy g(x, y(x)) -1 A T ∇λ * (x) -∇ 2 yy g(x, y(x)) -1 A T ∇λ * (x) ≤ ∇ 2 yy g(x, y * (x)) -1 -∇ 2 yy g(x, y(x)) -1 ∥∇ xy g(x, y * (x))∥ + ∇ 2 yy g(x, y(x)) -1 ∥∇ xy g(x, y * (x)) -∇ xy g(x, y(x))∥ + ∇ 2 yy g(x, y * (x)) -1 -∇ 2 yy g(x, y(x)) -1 A T ∇λ * (x) + ∇ 2 yy g(x, y(x)) -1 A T ∇λ * (x) -∇λ * (x) (c) ≤ 1 µ g 2 L gyy δL gxy + 1 µ g L gxy δ + 1 µ g 2 L gyy δL A L λ * + 1 µ g L A L λ * δ = 1 µ g 2 L gyy L gxy + 1 µ g L gxy + 1 µ g 2 L gyy L A L λ * + 1 µ g L A L λ * δ. In (a) the triangle inequality was used. In (b) we add and subtract the expressions [∇ yy g(x, y(x))] -1 ∇ xy g(x, y * (x)) and ∇ 2 yy g(x, y(x)) -1 A T ∇λ * (x) in the first and second terms, respectively. In (c) we apply Lemma 6(a), 6(b), Assumption 3(e), 3(f), the bound ∥A∥ ≤ L A , Lemmas 7 and 8. The proof is now complete. Now we have all the results needed to prove Lemma 3. Proof of Lemma 3. To begin with, the exact and approximate (due to the inexact solution of the LL problem) implicit gradients of the objective F (x), are given below. ∇F (x) = ∇ x f (x, y * (x)) + [∇y * (x)] T ∇ y f (x, y * (x)) ∇F (x) = ∇ x f (x, y(x)) + ∇y(x) T ∇ y f (x, y(x)). Then, we can compute the norm of their difference. ∥ ∇F (x) -∇F (x)∥ = ∥∇ x f (x, y(x)) + [∇ y(x)] T ∇ y f (x, y(x)) -∇ x f (x, y * (x)) -[∇y * (x)] T ∇ y f (x, y * (x))∥ (a) ≤ ∥∇ x f (x, y(x)) -∇ x f (x, y * (x))∥ + ∥[ ∇y(x)] T ∇ y f (x, y(x)) -[∇y * (x)] T ∇ y f (x, y * (x))∥ (b) ≤ ∥∇ x f (x, y(x)) -∇ x f (x, y * (x))∥ + ∥[ ∇y(x)] T ∇ y f (x, y(x)) -[∇y * (x)] T ∇ y f (x, y(x))∥ + ∥ [∇y * (x)] T ∇ y f (x, y(x)) -[∇y * (x)] T ∇ y f (x, y * (x))∥ (c) ≤ L f ∥ y(x) -y * (x))∥ + ∥ ∇y(x) -∇y * (x))∥∥∇ y f (x, y(x))∥ + ∥∇y * (x))∥∥∇ y f (x, y(x)) -∇ y f (x, y * (x))∥ (d) ≤ L f ∥ y(x) -y * (x))∥ + ∥ ∇y(x) -∇y * (x))∥∥∇ y f (x, y(x))∥ + L f ∥∇y * (x))∥∥ y(x) -y * (x)∥ (e) ≤ L f δ + L y * δL f + L f L y * δ = L f + L y * L f + L f L y * δ := L F δ, where T ∇ y f (x, y * (x)) L F = L f + L y * L f + L f L y * . Also, ≤ |∇ x f (x, y * (x))∥ + ∥∇y * (x)∥ ∥∇ y f (x, y * (x))∥ ≤ 1 + L y * L f := L F , where we applied Assumption 3(a) and Lemma 7. Similarly, we can see that ∇F (x) = ∇ x f (x, y(x)) + ∇y(x) T ∇ y f (x, y(x)) ≤ ∥∇ x f (x, y(x))∥ + ∇y(x) ∥∇ y f (x, y(x))∥ ≤ 1 + L y * L f = L F . Therefore, the proof is completed.

D.1.4 PROOF OF LEMMA 4

Proof. From the definition of the stochastic gradient in (10) we have ∇F (x; ξ) = ∇ x f (x, y(x); ξ) + [ ∇y * (x)] T ∇ y f (x, y(x); ξ). Taking expectation on both sides and utilizing Assumption 4, we get E ξ [ ∇F (x; ξ)] = E ξ ∇ x f (x, y(x); ξ) + [ ∇y * (x)] T ∇ y f (x, y(x); ξ) = E ξ ∇ x f (x, y(x); ξ) + [ ∇y * (x)] T E ξ ∇ y f (x, y(x); ξ) = ∇ x f (x, y(x); ξ) + [ ∇y * (x)] T ∇ y f (x, y(x); ξ) = ∇F (x). Similarly, for the variance of the stochastic implicit gradient, we have E ξ ∥ ∇F (x; ξ) -∇F (x)∥ 2 = E ξ ∇ x f (x, y(x); ξ) + [ ∇y * (x)] T ∇ y f (x, y(x); ξ) -∇ x f (x, y(x)) + [ ∇y * (x)] T ∇ y f (x, y(x)) 2 (a) ≤ 2 E ξ ∥∇ x f (x, y(x); ξ) -∇ x f (x, y(x))∥ 2 + 2 ∥ ∇y * (x)]∥ 2 E ξ ∥∇ y f (x, y(x); ξ) -∇ y f (x, y(x))∥ 2 (b) ≤ 2σ 2 f + 2L y * σ 2 f := σ 2 F , where (a) follows from ∥x + y∥ 2 ≤ 2∥x∥ 2 + 2∥y∥ 2 and (b) results from Assumption 4 and the application of Lemma 7. Therefore, we have the proof. D.2 PROOFS OF SECTION 3 D.2.1 PROOF OF LEMMA 5 Proof. Let {x r } ∞ r=0 with x r ∈ X be a given arbitrary countable sequence of points. Lemma 1 (adapted from (Lu et al., 2020 , Proposition 1)) states that for any given point x r in the above sequence, the SC condition holds w.p. 1 for the LL problem in (12), assuming that q is generated from a continuous measure and A(y * (x r )) is full row rank. This further implies that the mapping y * (x), and thereby, the implicit function, F (x) is differentiable w.p. 1 for each given x r in the above sequence (please see the discussion after Lemma 1), i.e., we have P(F (x) is differentiable at x r ) = 1 for each r = {0, 1, . . . , ∞}. This further implies that we have P F (x) is differentiable for all {x r } ∞ r=0 = P ∞ r=0 {F (x) is differentiable at x r } = 1 -P ∞ r=0 {F (x) is non-differentiable at x r } ≥ 1 - ∞ r=0 P ({F (x) is non-differentiable at x r }) = 1. where the second equality follows from the fact that P(ω ∈ A) = 1 -P(ω ∈ A c ) where A c denotes the complement of a measurable event A; the inequality uses the union bound; and the final equality utilizes (27) above.

D.2.2 PROOF OF PROPOSITION 2

Proof. From Assumption 1 we know that h(x, y) (and thus g(x, y)) is strongly convex in y with modulus µ g = µ h . As a result we have that h(x, y * (x)) ≥ h(x, y * (x)) + ⟨∇ y h(x, y * (x)), y * (x) -y * (x)⟩ + µ g 2 ∥y * (x) -y * (x)∥ 2 (28) g(x, y * (x)) ≥ g(x, y * (x)) + ⟨∇ y g(x, y * (x)), y * (x) -y * (x)⟩ + µ g 2 ∥y * (x) -y * (x)∥ 2 . ( ) By definition y * (x) is the global minimum of the objective h(x, y), and so it holds that ⟨∇ y h(x, y * (x)), y * (x) -y * (x)⟩ ≥ 0. Similarly, y * (x) is the global minimum of the objective g(x, y), and so it holds that ⟨∇ y g(x, y * (x)), y * (x) -y * (x)⟩ ≥ 0. Then, using the above inequali-ties and adding ( 28) and ( 29), we get h(x, y * (x)) + g(x, y * (x)) ≥ h(x, y * (x)) + g(x, y * (x)) + µ g ∥y * (x) -y * (x)∥ 2 µ g ∥y * (x) -y * (x)∥ 2 ≤ [h(x, y * (x)) -g(x, y * (x))] + [g(x, y * (x)) -h(x, y * (x))] µ g ∥y * (x) -y * (x)∥ 2 ≤ -q T y * (x) + q T y * (x) ∥y * (x) -y * (x)∥ 2 ≤ q T (y * (x) -y * (x)) µ g ∥y * (x) -y * (x)∥ 2 ≤ ∥q T ∥∥y * (x) -y * (x)∥ µ g ∥y * (x) -y * (x)∥ ≤ ∥q∥ µ g . Using the above bound and the fact that f is Lipschitz continuous (it follows from the bounded gradient assumption 3(a)) it is easy to see that |F (x) -G(x)| = |f (x, y * (x)) -f (x, y * (x))| ≤ L f ∥y * (x) -y * (x)∥ ≤ L f ∥q∥ µ g . Therefore, the proof is complete.

D.2.3 PROOF OF THEOREM 1

Lemma 10. Under Assumption 1, 2, 3, ∇F (x) is almost surely continuous at a neighborhood around x, for any given x ∈ X . Proof. To begin with, we already established in Lemma 2 that F is almost surely differentiable at any given x ∈ X . Therefore, for any x ∈ X there exists (almost surely) a neighborhood around it such that the matrix A corresponding to the active constraints at y * (x) remains unchanged, where the gradient ∇y * (x) is defined in eq. ( 6), (7). Further, since A is locally (i.e., around any given x) constant, and the formulas in ( 6), ( 7) can be seen as the results of a number of continuous operations over continuous functions, it is implied that ∇y * (x) is also a continuous function at a neighborhood around x almost surely. As a result, ∇F (x) = ∇ x f (x, y * (x)) + [∇y * (x)] T ∇ y f (x, y * (x)) is almost surely continuous locally around any given x ∈ X . Proof of Theorem 1. Here we follow a reasoning similar to the proof of (Bertsekas, 1998, Prop. 1.2.1) . However, there are a number of differences that make this proof more challenging. First, in our setting we are optimizing an inexact version of the objective F (x) = f (x, y(x)), using an approximate version of the gradient ∇F (x). Since the approximate gradient we are using is not the gradient of the objective F (x) (the gradient of this function might not even exist), we consider a modification of the standard Armijo rule where an additional error term is present. Secondly, in the proof below the (classical) mean value theorem (Bertsekas, 1998, Prop. 1.23 ) cannot be applied, since we cannot ensure that F is (surely) differentiable at any given interval over x. As we are going to show below, we use an alternative mean value theorem that does not require such assumption. To begin with, we know that for the exact implicit objective F (x) and gradient ∇F (x) (quantities to which we do not have access to) we can find at each iteration r a step-size a r such that the following condition holds F (x r ) -F (x r + a r d r ) ≥ -σa r [∇F (x r )] T d r , where d r = x rx r with x r = proj X (x r -∇F (x r )). Next, the difference between the (approximate) objective values of two successive iterates (for simplicity we will use the notation x r+1 = x r + a r d r , x r+1 = x r + a r d r ; d r is defined in Algorithm 1) is F (x r ) -F ( x r+1 ) = F (x r ) -F (x r ) + F (x r ) -F (x r+1 ) + F (x r+1 ) -F ( x r+1 ) + F ( x r+1 ) -F ( x r+1 ) = f (x r , y(x r )) -f (x r , y * (x r )) + F (x r ) -F (x r+1 ) + F (x r+1 ) -F ( x r+1 ) + f ( x r+1 , y * ( x r+1 )) -f ( x r+1 , y( x r+1 )) ≥ -L f ∥y * (x r ) -y(x r )∥ + F (x r ) -F (x r+1 ) -L F ∥x r+1 -x r+1 ∥ -L f ∥y * (x r+1 ) -y(x r+1 )∥ ≥ -L f δ r -σa r [∇F (x r )] T d r -L F a r ∥d r -d r ∥ -L f δ r+1 ≥ -L f δ r -σa r [∇F (x r )] T d r -L F L F a r δ r -L f δ r+1 = -σa r [∇F (x r )] T d r -ϵ 1 (δ; r), where we set ϵ 1 (δ; r) = L f δ r + L F L F a r δ r + L f δ r+1 . In the first inequality, we used the Lipschitz continuity of f and F ; in the second inequality Assumption 2(a) and condition (30) were applied; in the third inequality the non-expansive property of the projection operator was used. Also, we have that ∇ T F (x r )d r = ∇F (x r ) -∇F (x r ) + ∇F (x r ) T d r + d r -d r = ∇F (x r ) -∇F (x r ) T d r -d r + ∇F (x r ) -∇F (x r ) T d r + ∇F (x r ) T d r -d r + ∇F (x r ) T d r ≤ ∥∇F (x r ) -∇F (x r )∥∥d r -d r ∥ + ∥∇F (x r ) -∇F (x r )∥∥ d r ∥ + ∥ ∇F (x r )∥∥d r -d r ∥ + ∇F (x r ) T d r ≤ L 2 F (δ r ) 2 + L F L F δ r + L F L F δ r + ∇F (x r ) T d r = ∇F (x r ) T d r + ϵ 2 (δ; r), where ϵ 2 (δ; r) = L 2 F (δ r ) 2 + 2L F L F δ r . Notice that the results in the second inequality follow from Lemma 3. Then, combining (31) and (32) we get F (x r ) -F ( x r+1 ) ≥ -σa r ∇F (x r ) T d r -ϵ 1 (δ; r) -σa r ϵ 2 (δ; r) = -σa r ∇F (x r ) T d r -ϵ(δ; r), where ϵ(δ; r) = ϵ 1 (δ; r) + σa r ϵ 2 (δ; r); notice that lim δ→0 ϵ(δ) = 0. In conclusion, we can follow this (inexact) Armijo-type rule in our inexact problem; the existence of the (Armijo) step-size is guaranteed by its existence for the exact problem (30). Now let us move to the main part of the proof, which follows the reasoning used in (Bertsekas, 1998, Prop. 1.2.1) . Let {x r } ∈ X be the iterate sequence of our algorithm, and let x ∈ X be a limit point; the existence of such point is guaranteed by the closedness of the set X . Moreover, it is established in Proposition 1 that F (x) is continuous, and as a result it holds that lim r→+∞ F (x r ) = F (x). The latter results combined with the fact that all convergent sequences are also Cauchy sequences, implies that lim r→+∞ (F (x r ) -F (x r+1 )) = 0. We want to show that x is a stationary point of F (x). We are going to show that by assuming that the opposite holds, i.e., x is not a stationary point of F (x), and arriving at a contradiction. From the Armijo rule of our problem we have that F (x r ) -F ( x r+1 ) ≥ -σa r ∇F (x r ) T d r -ϵ(δ; r). Then, consider the following F (x r ) -F ( x r+1 ) = F (x r ) -F (x r ) + F (x r ) -F (x r+1 ) + F (x r+1 ) -F ( x r+1 ) + F ( x r+1 ) -F ( x r+1 ) = f (x r , y(x r )) -f (x r , y * (x r )) + F (x r ) -F (x r+1 ) + F (x r+1 ) -F ( x r+1 ) + f ( x r+1 , y * ( x r+1 )) -f ( x r+1 , y( x r+1 )) ≤ L f ∥y * (x r ) -y(x r )∥ + F (x r ) -F (x r+1 ) + L F ∥x r+1 -x r+1 ∥ + L f ∥y * (x r+1 ) -y(x r+1 )∥ ≤ L f δ r + F (x r ) -F (x r+1 ) + L F a r ∥d r -d r ∥ + L f δ r+1 ≤ F (x r ) -F (x r+1 ) + L f δ r + L F L F a r δ r + L f δ r+1 = F (x r ) -F (x r+1 ) + ϵ 1 (δ; r). In the first inequality, we used the Lipschitz continuity of f and F ; in the second inequality Assumption 2(a) and condition (30) were applied; in the third inequality the non-expansive property of the projection operator was used. Using the above derivation we can bound the left-hand side of inequality (33) as follows F (x r ) -F (x r+1 ) + ϵ 1 (δ; r) ≥ F (x r ) -F ( x r+1 ) ≥ -σa r ∇F (x r ) T d r -ϵ(δ; r). It is easy to see that the left-hand side in the above inequality tends to 0. Therefore, lim r→+∞ -σa r ∇F (x r ) T d r -ϵ(δ; r) ≤ 0. In addition, we know that lim r→+∞ ϵ(δ; r) = 0 and -σa r ∇F (x) T d r ≥ 0, ∀x ∈ X . From the above statements we can conclude that lim r→+∞ σa r ∇F (x r ) T d r = 0. Moreover, from the gradient-related assumption we know that for a non-stationary point x we have that lim sup r→∞,r∈R ∇F (x r ) T d r < 0, where{x r } R is subsequence with lim r→∞,r∈R x r = x. Then, the conditions (34), (35) imply that lim r→∞,r∈R a r = 0. In the subsequence R we can find an index r ≥ 0 such that F (x r ) -F x r + a r β d r < -σ a r β ∇ T F (x r ) d r -ϵ(δ; r), ∀r ∈ R, r ≥ r. Similarly with the proof of (Bertsekas, 1998, Prop. 1.2 .1) let us introduce the following sequences: p r = d r ∥ d r ∥ , ār = a r ∥ d r ∥ β The first sequence { p r } is bounded and so it admits a limit point p with ∥p∥ = 1, that is lim r→+∞,r∈R p r = p, where R denotes the indices of a subsequence of R. In addition, taking into account the facts that lim r→+∞,r∈R a r = 0 and the fact that the sequence {∥d r ∥} R is bounded we can easily see that lim r→+∞,r∈R ār = 0. Dividing both sides of (36) by ār and using the definitions of p r and ār from above we get F (x r ) -F (x r + ār p r ) ār < -σ ∇F (x r ) T p r -ϵ(δ; r), ∀r ∈ R, r > r. Then, we have that (for convenience we adopt the notation x r+1 = x r + ār p r ) F (x r ) -F ( x r+1 ) = F (x r ) -F (x r ) + F (x r ) -F ( x r+1 ) + F ( x r+1 ) -F ( x r+1 ) = f (x r , y(x r )) -f (x r , y * (x r )) + F (x r ) -F ( x r+1 ) + f ( x r+1 , y * ( x r+1 )) -f ( x r+1 , y( x r+1 )) ≥ -L f ∥y * (x r ) -y(x r )∥ -L f ∥y * ( x r+1 ) -y( x r+1 )∥ + F (x r ) -F ( x r+1 ) ≥ F (x r ) -F ( x r+1 ) -L f δ r -L f δ r+1 , where the first inequality above follows the Lipschitz continuity of f , and the second inequality is an application of Assumption 2(a). Incorporating inequality (38) into (37) results to F (x r ) -F (x r + ār p r ) ār -L f δ r + δ r+1 ār < -σ ∇F (x r ) T p r -ϵ(δ; r), ∀r ∈ R, r > r. Lebourg's mean value theorem (Lebourg, 1979 , Theorem 1.7) implies that F (x r ) -F (x r + ār p r ) ār = u T p r with u ∈ ϑF(x r + a r p r ) and a r ∈ [0, ār ], where ϑF (•) is the Clarke subdifferntial of F . We know that F is almost surely continuously differentiable (Lemma 10) at any x r + a r p r ∈ X , and so the Clarke subdifferential at x r + ār p r becomes w.p. 1 equal to ∇F (x r + a r p r ). Note that the we cannot use here the (classical) mean value theorem (Bertsekas, 1998, Prop. 1.23) , as in the proof of (Bertsekas, 1998, Prop. 1.2.1) , because it requires that the function F (x) is (surely) differentiable on the interval [x r , x r + ār p r ]. Then, we can rewrite the expression in (39) as follows Using the assumption that 0 ≤ δ r a r ∼ O(c r ), where c r is some sequence with lim r→∞,r∈R c r = 0, and the fact that lim r→∞,r∈R ∥ d r ∥ ̸ = 0 (because of the assumption that the sequence x r converges to a non-stationary point), we compute the limit in the above expression and get -L f β δ r + δ r+1 -[∇F (x)] T p < -σ [∇F (x)] T p 0 < (1 -σ) [∇F (x)] T p 0 < [∇F (x)] T p. However, note that ∇F (x r ) T p r = ∇ T F (x r ) d r ∥ d r ∥ and therefore if we take limits in both sides we obtain

[∇F (x)]

T pr ≤ lim sup r→∞,r∈R ∇ T F (x r ) d r lim sup r→∞,r∈R ∥ d r ∥ < 0, (41) due to the gradient-related assumption. We notice that expressions (40) and ( 41) lead to a contradiction. Therefore, x is a stationary point of F (x). The proof is now complete.  +2β 2 σ 2 F + L 2 F , where (a) follows from the fact that x r+1 ∈ X and xr ∈ X ; (b) results from the non-expansiveness of the projection operator; (c) uses ∥a -b∥ 2 = 2∥a∥ 2 + ∥b∥ 2 ; and (d) results from the application of Lemmas 3 and 4. Next, considering Term I and Term II separately in (42) above. For Term I, we get using the weak convexity of F (•) Term I = ⟨x rx r , ∇F (x r )⟩ ≤ F (x r ) -F (x r ) + ρ 2 ∥x rxr ∥ 2 We bound Term II using the Young's inequality as Term II = ⟨x rx r , ∇F (x r ) -∇F (x r )⟩ ≤ ρ 2 ∥x r -xr ∥ 2 + 1 2ρ ∥ ∇F (x r ) -∇F (x r )∥ 2 (e) ≤ ρ 2 ∥x r -xr ∥ 2 + 1 2ρ L 2 F δ 2 where (e) follows from Lemma 3. Next, substituting the bounds of Term I and Term II in (42) and using the definition of xr , we get E[H 1/ ρ(x r+1 )] ≤ F (x r ) + ρ 2 ∥x r -xr ∥ 2 + ρβ F (x r ) -F (x r ) + ρ∥x r -xr ∥ 2 + ρβ 2ρ L 2 F δ 2 + β 2 ρ σ 2 F + L 2 F ≤ H 1/ ρ(x r ) + ρβ F (x r ) -F (x r ) + ρ∥x rxr ∥ 2 Term III + ρβ 2ρ L 2 F δ 2 + β 2 ρ σ 2 F + L 2 F . Next, we bound Term III in (43) above. Term III = F (x r ) -F (x r ) + ρ∥x rxr ∥ 2 = F (x r ) + ρ 2 ∥x r -xr ∥ 2 -F (x r ) + 2ρ - ρ 2 ∥x r -xr ∥ 2 ≤ 3ρ -2ρ 2 ∥x r -xr ∥ 2 ≤ 3ρ -2ρ 2ρ 2 ∥∇H 1/ ρ(x r )∥ 2 , where the last equality follows from (13) and the first inequality follows from the fact that F (x) + ρ 2 ∥x r -x∥ 2 is (ρ -ρ)-strongly convex. This implies the following F (x r ) + ρ 2 ∥x r -xr ∥ 2 -F (x r ) ≤ -⟨∇F (x r ) + ρ(x r -xr ), x r -xr ⟩ - ρ -ρ 2 ∥x r -xr ∥ 2 ≤ ρ - ρ 2 ∥x r -xr ∥ 2 , where the second inequality results from the definition of xr . Finally, substituting Term III in (43) and rearranging the terms we get: β 2ρ -3ρ 2ρ ∥∇H 1/ ρ(x r )∥ 2 ≤ E H 1/ ρ(x r ) -H 1/ ρ(x r+1 ) + β 2 ρ σ 2 F + L 2 F + ρβ 2ρ L 2 F δ 2 . Summing over all r ∈ {0, 1, . . . , T -1} and dividing by T , we get 1 T T -1 r=0 ∥∇H 1/ ρ(x r )∥ 2 ≤ 2ρ 2ρ -3ρ H 1/ ρ(x 0 ) -H * βT + β ρ σ 2 F + L 2 F + ρ 2ρ L 2 F δ 2 . Therefore, we have the result.

D.2.5 STRONGLY-CONVEX OBJECTIVE: PROOF OF THEOREM 3

Proof. Using the update rule of the Algorithm 2, we have E∥x r+1 -x * ∥ 2 = E∥proj X (x r -β r ∇F (x r ; ξ r )) -x * ∥ 2 = E∥proj X (x r -β r ∇F (x r ; ξ r ))proj X (x * )∥ 2 (a) ≤ E∥x r -β r ∇F (x r ; ξ r ) -x * ∥ 2 = E ∥x rx * ∥ 2 + (β r ) 2 ∥ ∇F (x r ; ξ r )∥ 2 -2β r ⟨x rx * , ∇F (x r ; ξ r )⟩ ≤ E ∥x rx * ∥ 2 + 2(β r ) 2 ∥ ∇F (x r ; ξ r ) -∇F (x r )∥ 2 + 2(β r ) 2 ∥ ∇F (x r )∥ 2 -2β r ⟨x rx * , ∇F (x r )⟩ (c) ≤ E ∥x r -x * ∥ 2 + 2(β r ) 2 (σ 2 F + B 2 F ) -2β r ⟨x r -x * , ∇F (x r )⟩ ≤E ∥x r -x * ∥ 2 + 2(β r ) 2 (σ 2 F + B 2 F ) -2β r ⟨x r -x * , ∇F (x r )⟩ -2β r ⟨x r -x * , ∇F (x r ) -∇F (x r )⟩ (d) ≤ E ∥x r -x * ∥ 2 + 2(β r ) 2 (σ 2 F + B 2 F ) -2β r [F (x r ) -F * ] -µ F β r ∥x r -x * ∥ 2 + 2β r ∥x rx * ∥∥ ∇F (x r ) -∇F (x r )∥ Rearranging the terms, we get 2β r E[F (x r ) -F * ] ≤ E (1 -µ F β r )∥x r -x * ∥ 2 -∥x r+1 -x * ∥ 2 + 2(β r ) 2 (σ 2 F + B 2 F ) + 2β r D X L F δ E[F (x r ) -F * ] ≤ E 1 2β r - µ F 2 ∥x r -x * ∥ 2 - 1 2β r ∥x r+1 -x * ∥ 2 + β r (σ 2 F + B 2 F ) + D X L F δ . Summing over r ∈ {0, . . . , T -1}, multiplying by 1/T and using the fact that β r = 1 µ F (r+1) , we get 1 T T -1 r=0 E[F (x r ) -F * ] ≤ µ F 2T T -1 r=0 E r∥x r -x * ∥ 2 -(r + 1)∥x r+1 -x * ∥ 2 + 1 T T -1 r=0 β r (σ 2 F + B 2 F ) + D X L F δ. Telescoping the sum we get the following 1 T T -1 r=0 E[F (x r ) -F * ] ≤ - µ F 2 ∥x T -x * ∥ 2 + (σ 2 F + B 2 F ) µ F log(T ) T + D X L F δ ≤ (σ 2 F + B 2 F ) µ F log(T ) T + D X L F δ. Therefore, we have the proof. Summing over r ∈ {0, 1 . . . , T -1}, multiplying by 1/T and rearranging the terms, we get 1 T T -1 r=0 E[F (x r ) -F * ] ≤ ∥x 1 -x * ∥ 2 2βT + β(σ 2 F + L 2 F ) + D X L F δ.



This is the LICQ condition(Bertsekas, 1998) of the LL problem. It is used to ensure that the optimal solutions satisfy the KKT conditions. Under the assumption that ∇y i g(x, y * (x)) > 0, ∀i ∈ S(y * (x)).



Figure 1: Plot of y * (x).

the [S]SIGD algorithm converges to a stationary point at a rate of O(1/ √ T ) with an additive error determined by the accuracy of the LL problem's solution δ (see Assumption 2).

Figure 2: ∥∇F (x)∥ vs # of iterations.

Projected Gradient Descent (PGD) 1: Input: Initial parameter y 0 , Current iterate x, # iter T , step-sizes {γ r } T -1 r=0 , Constraints A, b 2: for r = 0, 1, . . . , T -1 do 3: y r+1 = y r -γ r ∇ y g(x, y r ) 4: Project y r+1 to Y = {y ∈ R d ℓ |Ay ≤ b} by solving the following QP: min y∈R d ℓ ∥y -y r+1 ∥ 2 s.t. Ay ≤ b (15) 5: end for C ADDITIONAL EXPERIMENTS In this section, we include additional experiments on quadratic bilevel optimization problems and Adversarial training along with the implementation details. First, we evaluate the performance of the [D]SIGD and [S]SIGD on quadratic bilevel optimization problems.

Figure 3: Convergence curves w.r.t. number of iterations.

Figure 4: The influence of Gaussian variance on the RA and SA. The experiments are based on CIFAR-10 with ResNet-18 model.

Figure 5: The influence of Gaussian variance on the RA and SA.

in inequality (a) above we apply the triangle inequality; in (b) we add and subtract the term [∇y * (x)] T ∇ y f (x, y(x)), and use triangle inequality; in (c) and (d) we use the Lipschitz gradient property of f (Assumption 2(b)); in (e) we apply Assumptions 3(a) and 2(a), and Lemmas 7 and 9. Now consider the expression ∥∇F (x)∥. We have that ∥∇F (x)∥ = ∇ x f (x, y * (x)) + [∇y * (x)]

r ∥ d r ∥ -[∇F (x r + a r p r )] T p r < -σ ∇F (x r ) T p r -ϵ(δ; r), ∀r ∈ R, r > r,where a r ∈ [0, ār ].

z∥ 2 . Using the definition of Moreau envelope, we haveE[H 1/ ρ(x r+1 )] ≤ E F (x r ) x r -β ∇F (x r ; ξ r ))proj X (x r )∥ 2 (b) ≤ E F (x r ) + ρ 2 ∥x r -β ∇F (x r ; ξ r ) -xr ∥ 2 = F (x r ) + ρ 2 E ∥x rxr ∥ 2 -2⟨x rxr , β ∇F (x r ; ξ r )⟩ + β 2 ∥ ∇F (x r ; ξ r )∥ 2 (c) ≤ F (x r ) + ρ 2 E ∥x rxr ∥ 2 -2⟨x rxr , β ∇F (x r ; ξ r )⟩ + 2β 2 ∥ ∇F (x r ; ξ r ) -∇F (x r )∥ 2 + 2β 2 ∥ ∇F (x r )∥ 2 (d) ≤ F (x r ) + ρ 2 ∥x rxr ∥ 2 -2β⟨x rxr , ∇F (x r )⟩ + 2β 2 σ 2 F + L 2 F ≤ F (x r ) + ρ 2 ∥x rxr ∥ 2 + 2β ⟨x rx r , ∇F (x r )⟩ Term I+ 2β ⟨x rx r , ∇F (x r ) -∇F (x r )⟩Term II

E (1 -µ F β r )∥x rx * ∥ 2 + 2(β r ) 2 (σ 2 F + B 2 F ) -2β r [F (x r ) -F * ] + 2β r D X L F δwhere in (a) the non-expansive property of the projection operator is applied, in (b) we add and subtract the term ∇F (x r ), use the well-known inequality ∥a + b∥ 2 ≤ 2∥a∥ 2 + 2∥b∥ 2 , and in the last term we utilize Lemma 4; (c) results from the application of Lemmas 3 and 4; (d) uses strong-convexity (Assumption 6) of F (•) and Cauchy-Schwartz inequality; finally, (e) results from the application of Lemma 3 and Assumption 1(b).

.6 CONVEX OBJECTIVE: PROOF OF THEOREM 4Proof. From the update rule of Algorithm 2, we have for β r = β for all r ∈ {0, 1, . . . , T -1}E[∥x r+1 -x * ∥ 2 ] = E[∥proj X (x r -β ∇F (x r ; ξ r )) -x * ∥ 2 ] (a) = E[∥proj X (x r -β ∇F (x r ); ξ r )proj X (x * )∥ 2 ] (b) ≤ E[∥x r -β ∇F (x r ; ξ r ) -x * ∥ 2 ] (c) = E[∥x rx * ∥ 2 + β 2 ∥ ∇F (x r ; ξ r )∥ 2 -2β⟨x rx * , ∇F (x r ; ξ r )⟩] (d) ≤ E[∥x rx * ∥ 2 + 2β 2 ∥ ∇F (x r ; ξ r ) -∇F (x r )∥ 2 + 2β 2 ∥ ∇F (x r )∥ 2 -2β⟨x rx * , ∇F (x r )⟩ -2β⟨x rx * , ∇F (x r ) -∇F (x r )⟩] (e) ≤ E[∥x rx * ∥ 2 + 2β 2 σ 2 F + 2β 2 L 2 -2β(F (x r ) -F * ) + 2β∥x rx * ∥ ∥ ∇F (x r ) -∇F (x r )∥] (f ) ≤ E[∥x rx * ∥ 2 + 2β 2 (σ 2 F + L ) -2β(F (x r ) -F * ) + 2βD X L F δ]where (a) follows from the fact that x * ∈ X ; (b) results from the non-expansiveness of the projection operator; (c) uses ∥a -b∥ 2 = ∥a∥ 2 + ∥b∥ 2 -2⟨a, b⟩; (d) utilizes the fact that ∥a -b∥ 2 = 2∥a∥ 2 + 2∥b∥ 2 and unbiased gradient from Lemma 4; (e) results from Lemmas 3 and 4, the convexity assumption of the implicit function (Assumption 6 with µ F = 0) and the Cauchy-Schwartz inequality; and (f ) results from Assumption 1(b) and Lemma 3.

Performance overview of different methods on CIFAR-100 Krizhevsky et al. (2009) with ResNet-18 He et al. (2016). The result a ±b represents the mean value a with a standard deviation of b over 5 random trials. SA 53.83 ±0.19 53.33 ±0.18 53.88 ±0.22 54.01 ±0.24 53.79 ±0.14 54.44 ±0.18 57.74 ±0.22 RA 27.36 ±0.24 28.44 ±0.17 27.43 ±0.12 28.22 ±0.10 28.12 ±0.14 27.14 ±0.21 25.22 ±0.15

Performance ±0.11 70.22 ±0.29 71.43 ±0.14 72.79 ±0.24 73.50 ±0.09 73.98 ±0.35 75.31 ±0.33 RA 32.12 ±0.18 33.35 ±0.14 32.72 ±0.25 31.73 ±0.10 29.97 ±0.14 29.39 ±0.15 27.67 ±0.07

Using Jensen's inequality and denoting

Therefore, we have the proof.

