EFFICIENT METHOD FOR BI-LEVEL OPTIMIZATION WITH NON-SMOOTH LOWER-LEVEL PROBLEM

Abstract

Bi-level optimization plays a key role in a lot of machine learning applications. Existing state-of-the-art bi-level optimization methods are limited to smooth or some specific non-smooth lower-level problems. Therefore, achieving an efficient algorithm for the bi-level problems with a generalized non-smooth lower-level objective is still an open problem. To address this problem, in this paper, we propose a new bi-level optimization algorithm based on smoothing and penalty techniques. Using the theory of generalized directional derivative, we derive new conditions for the bilevel optimization problem with nonsmooth, perhaps non-Lipschitz lower-level problem, and prove our method can converge to the points satisfying these conditions. We also compare our method with existing state-ofthe-art bi-level optimization methods and demonstrate that our method is superior to the others in terms of accuracy and efficiency.

1. INTRODUCTION

Bi-level optimization (BO) (Bard, 2013; Colson et al., 2007) plays a central role in various machine learning applications including hyper-parameter optimization (Pedregosa, 2016; Bergstra et al., 2011; Bertsekas, 1976) , meta-learning (Feurer et al., 2015; Franceschi et al., 2018; Rajeswaran et al., 2019) , reinforcement learning (Hong et al., 2020; Konda & Tsitsiklis, 2000) . It involves a competition between two parties or two objectives, and if one party makes its choice first it will affect the optimal choice for the other party. Several approaches, such as Bayesian optimization (Klein et al., 2017) , random search (Bergstra & Bengio, 2012) , evolution strategy (Sinha et al., 2017) , gradient-based methods (Pedregosa, 2016; Maclaurin et al., 2015; Swersky et al., 2014) , have bee proposed to solve BO problems, among which gradient-based methods have become the mainstream for large-scale BO problems. The key idea of the gradient-based method is to approximate the gradient of upper-level variables, called hypergradient. For example, the implicit differentiation methods (Pedregosa, 2016; Rajeswaran et al., 2019) use the first derivative of the lower-level problem to be 0 to derive the hypergradient. The explicit differentiation methods calculate the gradient of the update rules of the lower-level based on chain rule (Maclaurin et al., 2015; Domke, 2012; Franceschi et al., 2017; Swersky et al., 2014) to approximate the hypergradient. Mehra & Hamm (2019) reformulate the bilevel problem as a single-level constrained problem by replacing the lower level problem with its first-order necessary conditions, and then solve the new problem by using the penalty method. Obviously, all these methods need the lower-level problem to be smooth. However, in many real-world applications, such as image restoration (Chen et al.; Nikolova et al., 2008) , variable selection (Fan & Li, 2001; Huang et al., 2008; Zhang et al., 2010) and signal processing (Bruckstein et al., 2009) , the objective may have a complicated non-smooth, perhaps non-Lipschitz term (Bian & Chen, 2017) . Traditional methods cannot be directly used to solve the bilevel problem with such a lower-level problem. To solve the BO problems with some specific nonsmooth lower-level problems, researchers have proposed several algorithms based on the above-mentioned methods. Specifically, Bertrand et al. (2020) searched the regularization parameters for LASSO-type problems by approximating the hypergradient from the soft thresholding function (Donoho, 1995; Bredies & Lorenz, 2008; Beck & Teboulle, 2009) . Frecon et al. (2018) proposed a primal-dual FMD-based method, called FBBGLasso, to search the group structures of group-LASSO problems. Okuno et al. (2021) used the smoothing method and constrained optimization method to search the regularization parameter of q-norm (0 < q ≤ 1) and provided the convergence analysis of their method. We summarize several representative methods in Table 1 . Obviously, all these methods and their theoretic analysis only focus on some specific problem and can not be used to solve the bilevel problem with a generalized nonsmoothed lower-level problem. Therefore, how to solve the BO problem with a generalized non-smooth lower-level objective and obtain its convergence analysis are still open problems. To address this problem, in this paper, we propose a new algorithm, called SPNBO, based on smoothing (Nesterov, 2005; Chen et al., 2013) and penalty (Wright & Nocedal, 1999) techniques. Specifically, we use the smoothing technique to approximate the original non-Lipschitz lower-level problem and generate a sequence of smoothed bi-level problems. Then, a single-level constrained problem is obtained by replacing the smoothed lower-level objective with its first-order necessary condition. For each given smoothing parameter, we propose a stochastic constraint optimization method to solve the single-level constrained problem to avoid calculating the Hessian matrix of the lower-level problem. Theoretically, using the theory of generalized directional derivative, we derive new conditions for the bilevel optimization problem with nonsmooth, perhaps non-Lipschitz lower-level problem, and prove our method can converge to the points satisfying these conditions. We also compare our method with several state-of-the-art bi-level optimization methods, and the experimental results demonstrate that our method is superior to the others in terms of accuracy and efficiency. Contributions. We summarize the main contributions of this paper as follows: 1. We propose a new method to solve the non-Lipschitz bilevel optimization problem based on the penalty method and smoothing method. By using the stochastic constraint method, our method can avoid calculating the Hessian matrix of the lower-level problem, which makes our method a lower time complexity. 2. Based on the Clarke generalized directional derivative, we propose new conditions for the bilevel problem with a generalized non-smoothed lower-level problem. We prove that our method can converge to the proposed conditions.

2. PRELIMINARIES 2.1 FORMULATION OF NON-SMOOTH BI-LEVEL OPTIMIZATION PROBLEM

In this paper, we consider the following non-smooth bi-level optimization problem: min λ f (w * , λ) s.t. w * ∈ arg min w g(w, λ) + exp(λ 1 )φ(h(w)), where  λ := [λ 1 , λ 2 , • • • , λ m ] T ∈ R m , λ := [λ 2 , • • • , λ m ] T and w ∈ R d . f : R d × R m → R = (h 1 (D T 1 w), h 2 (D T 2 w), • • • , h n (D T n w)), where D i ∈ R d×r and h i : R d → R (i = 1, 2, • • • , n) is continuous. For a fixed point w, assume we have an index set I w = {i ∈ {1, 2, • • • , n} : h i is not Lipschitz continuous at D T i w} and if i ̸ ∈ I w, h i is twice continuously differentiable.

2.2. EXAMPLES OF NON-SMOOTH NON-LIPSCHITZ LOWER-LEVEL PROBLEMS

The non-smooth non-Lipschitz optimization problems widely exist in image restoration (Chen et al.; Nikolova et al., 2008) , variable selection (Fan & Li, 2001; Huang et al., 2008; Zhang et al., 2010) and signal processing (Bruckstein et al., 2009) . Here, we give two examples as follows. 1. l p -norm (Chen et al., 2013) : min w g(w, λ) + exp(λ 1 ) d i=1 |w i | p , where p ∈ (0, 1]. 2. OSCAR penalty (Bondell & Reich, 2008) : min w g(w, λ) + exp( λ)∥w∥ 1 + exp( λ) i<j max{w Gi , w Gj }, where G i denotes the group index. Note that Okuno et al. (2021) only considered the bilevel problem with the lower-level problem given in Example 1. Their theoretical analysis is not suitable for the problem in Example 2 or even more complicated formulation.

3. PROPOSED METHOD

In this section, we give a brief review of the smoothing method and then propose our stochastic gradient algorithm based on the penalty method and single-level reduction method to solve the bilevel problem.

3.1. SMOOTHING TECHNIQUE

Here, we give the definition of smoothing function (Nesterov, 2005; Chen et al., 2013; Bian & Chen, 2017 ) which is widely used in nonsmooth non-Lipschitz problems. Definition 1. Let ψ : R d → R be a continuous nonsmooth, non-Lipschitz function. We call ψ : R d × [0, +∞] → R a smoothing function of ψ, if ψ(•, µ) is twice continuously differentiable for any fixed µ > 0 and lim ŵ →w,µ→0 ψ( ŵ, µ) = ψ(w) holds for any w ∈ R d . Here, we give two examples of smoothing functions. The smoothing function of ψ 1 (w) = d i=1 |w i | is ψ1 (w, µ) = d i=1 (w 2 i + µ 2 ) 1/2 and the smoothing function of ψ 2 (w) = i<j max{w i , w j } is ψ2 (w, µ) = i<j 1 2 ( (w i + w j ) 2 + µ 2 + (w i -w j ) 2 + µ 2 ). According to Definition 1, the non-smooth lower level problem in problem (1) could be approximated by using a sequence of the following parameterized smoothing functions, w * = arg min w g(w, λ) + exp(λ 1 )φ( h(w, µ k )) where µ k > 0 is the smoothing parameter and h(w, µ k ) := ( h1 (D T 1 w, µ k ), h2 (D T 2 w, µ k ), • • • , hn (D T n w, µ k )). For each given smoothing parameter µ k > 0, we can replace the smoothed lower-level objective with its first-order necessary condition and derive the following single-level problem: min w,λ f (w, λ) s.t. c(w, λ; µ k ) = 0, where c(w, λ; µ k ) := ∇ w g(w, λ) + exp(λ 1 )∇ w φ( h(w, µ k )) and ∇ w φ( h(w, µ k )) = φ ′ (z) z=h(w,µ k ) ∇ w h(w, µ k ).

3.2. STOCHASTIC CONSTRAINT GRADIENT METHOD

In this subsection, we discuss our method to solve the subproblem (3). Obviously, we can use the gradient method to solve its corresponding penalty function to solve the single-level constrained problem. However, calculating the gradient of the penalty functions needs to calculate the Hessian matrix. If the dimension of w, calculating the Hessian matrix is very time-consuming. To solve this problem, we introduce a stochastic layer into the constraint such that we only need to calculate the Algorithm 1 Smoothing and Penalty Method for Non-Lipschitz Bi-level Optimization (SPNBO) Input: K, µ 1 , β 1 , δµ, δϵ ∈ (0, 1). Output: w k+1 and λ k+1 . 1: for k = 1, ..., K do 2: Find (w k+1 , λ k+1 , p k+1 ) := min w,λ max p∈∆ d L(w k , λ k , p k , µ k ) using the SCG method. 3: µ k+1 = δµµ k . 4: ϵ k+1 = δϵϵ k . 5: end for gradient of the sampled element of the constraint. Specifically, we reformulate the subproblem (3) as the following minimax problem min w,λ max p∈∆ d L(w, λ, p, µ k ) = f (w, λ) + β d i=1 p i c 2 i (w, λ; µ k ) - τ 2 ∥p∥ 2 2 , where β > 0, λ > 0, p ∈ ∆ d := {p| d i=1 p i = 1&0 ≤ p i ≤ 1}, c i (w, λ; µ k ) denote the i-th elements of c(w, λ; µ k ). The last term is used to ensure L is strongly-concave on p. Such a reformulation is widely used in many methods (Cotter et al., 2016; Narasimhan et al., 2020; Shi et al., 2022) to solve the constrained problem. In each iteration, we sample an element w i of w according to distribution p and calculate the corresponding value of c i and its gradient w.r.t w. Then, we can obtain the stochastic gradient of L w.r.t w as follows, ∇w L(w t , λ t , p t , µ k ; ξ t ) =∇ w f (w t , λ t ) + 2βc i (w t , λ t ; µ k )∇ w c i (w t , λ t ; µ k ). (5) Using the same method, we can obtain the stochastic gradient ∇λ L(w t , λ t , p t , µ k ; ξ t ). Then, Algorithm 2 Stochastic constraint gradient (SCG) Input: γ, σ, ηt, at+1,1, at+1,2 Output: w and λ. 1: Initialize m1,1, m1,2, m1,3, mt,1, mt,2, mt,3, η1. 2: while Not satisfy the conditions (10) do 3:  wt+1 = wt -γA -1 t,1 mt,1. 4: wt+1 = wt + ηt( wt+1 -wt). 5: λt+1 = λt -γA -1 t,2 mt,2. 6: λt+1 = λt + ηt( λt+1 -λt). 7: pt+1 = P∆ pt + σA -1 t, ; ξ t ) =de j (c 2 j (w t , λ t ; µ k ) -λp j ), where e j denotes a vector where its j-th element is 1 and other elements are 0. Since p is related to the value of constraints, sampling constraint according to p helps us find the most violating conditions. To achieve a better performance, the momentum-based variance reduction method and adaptive method are also used. Specifically, we calculate the momentum-based gradient estimation w.r.t w, λ and p as follows, m t,1 = ∇w L(w t , λ t , p t , µ k ; ξ t ) + (1 -a t+1,1 ) m t-1,1 -∇w L(w t-1 , λ t-1 , p t-1 , µ k ; ξ t ) m t,2 = ∇λ L(w t , λ t , p t , µ k ; ξ t ) + (1 -a t+1,1 ) m t-1,2 -∇λ L(w t-1 , λ t-1 , p t-1 , µ k ; ξ t ) m t,3 = ∇p L(w t , λ t , p t , µ k ; ξ t ) + (1 -a t+1,2 ) m t-1,3 -∇p L(w t-1 , λ t-1 , p t-1 , µ k ; ξ t ) Then, we calculate the adaptive matrix matrices A t,1 , A t,2 and A t,3 for updating w, λ and p, respectively. Here, we present the calculation of adaptive matrix A t,1 as an example. Specifically, we calculate a second momentum-based estimation mt,1 = â ∇w L(w t , λ t , p t , µ k ; Then, we can obtain the update rules as follows, ξ t ) 2 + (1 -â)m t-1,1 . Then, wt+1 = w t -γA -1 t,1 m t,1 , w t+1 = w t + η t ( wt+1 -w t ), (7) λt+1 = λ t -γA -1 t,2 m t,2 , λ t+1 = λ t + η t ( λt+1 -λ t ), pt+1 = P ∆ p t + σA -1 t,3 m t,3 , p t+1 = p t + η t ( pt+1 -p t ), where γ > 0, σ > 0, η t > 0 and P ∆ (•) denotes the projection onto ∆ d . Once the following conditions are satisfied, ∥∇ w L(w, λ, p, µ k )∥ 2 2 ≤ ϵ 2 k , ∥∇ λ L(w, λ, p, µ k )∥ 2 2 ≤ ϵ 2 k , ∥c(w, λ; µ k )∥ 2 2 ≤ ϵ 2 k , where ∇ w L and ∇ λ L denote the full gradients L w.r.t w and λ, we enlarge the penalty parameter β, and decrease the smooth parameter µ k . The whole algorithm is presented in Algorithm 1 and 2. Note that instead of checking the conditions in each iteration of SCD, we check the conditions after several iterations to save time.

4. THEORETICAL ANALYSIS

In this section, we discuss the convergence performance of our proposed method (Detailed proofs are given in our appendix). Here, we give several assumptions which are widely used in convergence analysis. Assumption 1. We have E[ ∇z L(z t+1 , p t+1 , µ k )] = ∇ z L(z t+1 , p t+1 , µ k ),E[ ∇p L(z t+1 , p t+1 , µ k ))] = ∇ p L(z t+1 , p t+1 , µ k ), E[∥ ∇z L(z t+1 , p t+1 , µ k ) -∇ z L(z t+1 , p t+1 , µ k )∥ 2 ] ≤ V 2 z and E[∥ ∇p L(z t+1 , p t+1 , µ k )) -∇ p L(z t+1 , p t+1 , µ k ))∥ 2 ] ≤ V 2 p . . Assumption 2. The function L(w, λ, p, µ k ) is τ -strongly concave on p for any give µ k > 0. Assumption 3. The function L(w, λ, p, µ k ) has a L 1 -Lipschitz gradient on (w, λ, p) for any give µ k > 0. Assumption 4. The smoothing function h(w, µ k ) is twice continuously differentiable on w for any µ k > 0 . Assumption 5. f is Lipschitz continuous and g is twice Lipschitz continuous w.r.t. w and λ. We prove our Algorithm 2 can converge to the points satisfying conditions 10. Here, we give the definitions of ϵ-stationary of the constrained problem 3 and minimax problem 4 and then show the relations between these definitions as follows, Definition 2. (ϵ-stationary point of the constrained optimization problem.) (w * , λ * , α * ) is said to be the ϵ-stationary point of the sub-problem (3) if the following conditions hold, ∥∇ w f (w * , λ * ; µ k )+ d i=1 α * i ∇ w c i (w * , λ * )∥ 2 2 ≤ ϵ 2 1 , ∥∇ λ f (w * , λ * ) + d i=1 α * i ∇ λ c i (w * , λ * ; µ k )∥ 2 2 ≤ ϵ 2 2 and d i=1 c 2 i (w * , λ * ; µ k ) ≤ ϵ 2 3 , where α denotes the lagrangian multipliers. Remark 1. Let α * j = p * j 2c j (w * , λ * ; µ k ). The conditions in Defintion 2 is equivalent to the tolerance conditions 10. Definition 3. (ϵ-stationary point of the mini-max problem.) (w * , λ * , p * ) is said to be the ϵstationary point of the mini-max problem if it satisfies the conditions ∥∇ w L∥ 2 2 ≤ ϵ 2 , ∥∇ λ L∥ 2 2 ≤ ϵ 2 and ∥∇ p L∥ 2 2 ≤ ϵ 2 . Proposition 1. If Assumptions 2 and 3 hold, (w * , λ, p * ) is the ϵ-stationary point of the problem (4), then (w * , λ * ) is the ϵ-stationary point of the constrained problem 3. According to Shi et al. (2022) , the minimax problem 4 is equivalent to the following minimization problem: min w,λ H(w, λ) := max p∈∆ m L(w, λ, p) = L(w, λ, p * (w, λ) , where p * (w, λ) = arg max p L(w, λ, p). Here, we give stationary point the minimization problem 11 and its relationship with Definition 3 as follows, Definition 4. We call w * an ϵ-stationary point of a differentiable function H(w), if ∥∇ w H(w * , λ * )∥ 2 ≤ ϵ and ∥∇ λ H(w * , λ * )∥ 2 ≤ ϵ. Proposition 2. Under Assumptions 3 and 2, if (w ′ , λ ′ ) is the ϵ-stationary point of H(w, λ), then (w ′ , λ ′ , p ′ ) is the ϵ-stationary point of min w,λ max p∈∆ d L(w, λ, p, µ k ) can be obtained. Cons- versely, if (w ′ , λ ′ , p ′ ) is the ϵ-stationary point of min w,λ max p∈∆ d L(w, λ, p, µ k ), then a point (w ′ , λ ′ ) is stationary point of H(w, λ). Remark 2. According to Proposition 1 and Proposition 2, we have that once we find the ϵ-stationary point in terms of Definition 4, then we can get the ϵ-stationary point in terms of Definition 2. Therefore, we can obtain the points satisfying the tolerance conditions (10). Before, we give the convergence reuslt of our method, we present the lemma useful in our analysis. We have Lemma 1. Under assumptions, let z = [w; λ], we have ∥∇H(z) -m t,z ∥ 2 2 ≤2L 2 1 ∥p * (z t ) -p t ∥ 2 2 + 2∥∇ z L(z t , p t , µ k ) -m t,z ∥ 2 2 (12) Then, we can define the following metric M k = b 2 γ 2 ∥ zt+1 -z t ∥ 2 2 + 2L 2 1 ∥p * (z t ) -p t ∥ 2 2 + 2∥∇ z L(z t , p t , µ k ) -m t,z ∥ 2 2 . ( ) We have M k ≥ b 2 γ 2 ∥zt -γA -1 t,z mt,z -zt∥ 2 2 + ∥∇H(z) -mt,z∥ 2 2 ≥ ∥mt,z∥ 2 2 + ∥∇H(z) -mt,z∥ 2 2 ≥ 1 2 ∇H(zt) If M k → 0, we have ∥∇H(z t )∥ 2 2 → 0. Thus, we can bound M k to find the stationary point of problem (11) . Then, we give the convergence theorem in the following theorem. Theorem 1. Assume Assumptions hold, if a t+1,1 = c 1 η 2 t ,a t+1,2 = c 2 η 2 t , c 1 > 5 2 + 2 3e 3 η t , 0 < γ ≤ √ 3τ σρ 2 12L 2 1 σ 2 κ 2 + 125L 2 1 κ 2 b 2 and 0 < σ ≤ min{ 15 12µ , 1 6L 1 }, we have γ 4ρ 1 T T t=1 M t ≤ (Θ 1 -Θ * )(m + T ) 1/3 T e + γ(c 2 1 V 2 z + c 2 2 V 2 p )(m + T ) 1/3 ρT ln(m + T ) (14) Remark 3. Theorem 1 demonstrate that with suitable setting, our method can converge to the points satisfying the conditions (10) at the rate of Õ(T -2/3 ) if omitting log. Then, we discuss the convergence performance of our whole algorithm. Define a new function h w i (D T i w) := h i (D T i w) i ̸ ∈ I w h i (D T i w) i ∈ I w , which is Lipschitz continuous at D T i w, i = 1, 2, • • • , n. Then, we have h w(w) := (h w 1 (D T 1 w), h w 2 (D T 2 w), • • • , h w n (D T n w)) , which has the same value as h(w) but opposite property. For convenience, we define ϕ w(w) = φ(h w(w)) and ϕ(w) = φ(h(w)). Besides, we define a vector set as follows, V w = v : D T i v = 0, i ∈ I w , which means that v is perpendicular to all column vectors in D i , i ∈ I w. According to Bian & Chen (2017) , the necessary condition of the non-Lipschitz lower level problem is ∇ w g(w * , λ) T v + exp(λ 1 )ϕ • (w * ; v) ≥ 0, for all v ∈ V w * , where ϕ • (w * ; v) = lim sup w → w * t ↓ 0 ϕ(w + tv) -ϕ(w) t denotes the Clarke generalized directional derivative of φ(h(w)) at w * . Replacing the lower-level problem with above condition, we can obtain the following single-level problem, min w,λ f (w, λ) s.t. c(w, λ) = ∇ w g(w, λ) T v + exp(λ 1 )ϕ • (w; v) ≥ 0 for all v ∈ V w * . For this new problem, we have the following theorem. Theorem 2. If (w * , λ * ) satisfy the following conditions, then they are the stationary points of the problem (18). ∇ w f (w * , λ * ) T v 2 -(v T 2 ∇ 2 ww g(w * , λ * )v 1 + exp(λ * 1 )ϕ •• (w * ; v 1 , v 2 ))ξ * ≥ 0, ∇ λ f (w * , λ * ) T v 3 -(v 3 ∇ 2 w λg(w * , λ * )v 1 + v 1 3 exp(λ * 1 )ϕ • (w * ; v 1 ))ξ * ≥ 0, ∇ w g(w * , λ * ) T v 1 + exp(λ * 1 )ϕ • (w * ; v 1 ) ≥ 0, ξ * ∇ w g(w * , λ * ) T v 1 + λ * 1 ϕ • (w * ; v 1 ) = 0, * ≥ 0, for all v 1 ∈ V w * , v 2 ∈ R d , v 3 ∈ R m , where v 3 = [v 1 3 , vT 3 ] T , vT 3 = [v 2 3 , • • • , v m 3 ] T and ϕ •• (w * ; v 1 , v 2 ) = lim sup w → w * , s ↓ 0 ϕ • (w + v 2 s; v 1 ) -ϕ • (w; v 1 ) s ). In addition, (v 1 , v 2 , v 3 ) is direction vector used in calculating the Clarke directional derivative. Then, we show with decreasing the smoothing parameter and tolerance parameters, our method can converge to stationary point defined in Theorem 2 in the following theorem. Theorem 3. Suppose {ϵ k } ∞ k=1 are positive and convergent (lim k→∞ ϵ k = 0) sequences, {µ k } ∞ k=1 is a positive and convergent (lim k→∞ µ k = 0) sequence. Then any limit point of the sequence points generated by SPNBO satisfies the conditions ( 19)-( 23). Then, we show the relations between the conditions 19-23 and the original nonsmooth bilevel problem (1). Theorem 4. Assume the lower level problem in problem ( 1) is strongly convex. If we have (w * , λ * ) and ξ * ≥ 0 satisfying the conditions ( 19)-( 23), then (w * , λ * ) is the stationary point of the original nonsmooth bilevel problem.

5. EXPERIMENTS

In this section, we conduct experiments to demonstrate the superiority of our method in terms of accuracy and efficiency.

5.1. EXPERIMENTAL SETUP

We summarize the baseline methods used in our experiments as follows. 1. Penalty. The method proposed in Mehra & Hamm (2019) . It formulates the bi-level optimization problem as a one-level optimization problem, and then use penalty method to solve the new problem. 2. Approx. The method proposed in Pedregosa (2016) . It solves an additional linear problem to find the hypergradient to update the hyper-parameters. 3. RMD. The reverse method proposed in Franceschi et al. (2017) . An additional loop is used to approximate the hypergradient. 4. SMNBP. The method proposed in Okuno et al. (2021) . It uses the smoothing method to produce a sequence of smoothing lower-level functions and replaces them with the necessary condition. Then the penalty method is used to solve each single level problem. We implement SMNBP, Penalty, Approx, RMD, and our method in Python. Since original Penalty, Approx and RMD are used for the smoothing problems, we use the smoothing function to approximate the lower-level problem. We fix the smoothing parameter at µ = 0.0001 in these methods. In SMNBP, for each given smoothing parameter, we solve the constrained problem using the Penalty method. For all these method, we use ADAM to update w and λ and choose the initial step size from {0.1, 0.01, 0.001}. For our method, we set â = 0.9 and other parameters are set according to Theorem 1. We choose the penalty parameter from {0.1, 1, 10, 100} for our method, SMNBP, and Penalty. We fix the inner iteration number T in Penalty, Approx, and RMD at 10 according to Mehra & Hamm (2019) . We summarize the datasets used in our experiments in Table 2 and we divide all the datasets into three parts, i.e., 40% for training, 40% for validation and 20% for testing. All the experiments are carried out 10 times on a PC with four 1080 Ti GPUs. All the results are presented in Tables 3, 4 and Figure 1 , 2. From Table 3 and Table 4 , we can find that our method has the similar results to other methods. From Figure 1 and Figure 2 , we can find that our method is faster than other methods in most cases. This is because Approx and RMD need to solve the lower-level objective first and then need an additional loop to approximate the hypergradient which makes these methods have higher time complexity. Penalty and SMNBP need to use all the constraints in each updating step which is also time-consuming, when we use complex models (e.g., DNNs), Penalty and SMNBP suffer from high time complexity. However, our method uses the stochastic gradient method which makes it scalable to complicated models and does not need any intermediate steps to approximate the hypergradient. From all these results, we can conclude that our SPNBO is superior to other methods in terms of accuracy and efficiency.

6. CONCLUSION

In this paper, we proposed a new method, SPNBO, to solve the generalized non-smooth non-Lipschitz bi-level optimization problems by using the smoothing method and the penalty method. We also give the convergence analysis of our proposed method. The experimental results demonstrate the superiority of our method in terms of training time and accuracy.



we clip each element of mt,1 into the range of [ρ, b] and obtain the adaptive matrix A t,1 = diag clip( mt,1 , ρ, b) . Note that we can use other method to calculate the adaptive matrices, such as AdaGrad-Norm (Ward et al., 2020), AMSGrad (Reddi et al., 2019), Adam + (Liu et al., 2020).

Figure 1: Test MSE against training time of all the methods in data re-weight.

Figure 2: Validation loss versus training time of all the methods in training data poisoning.

Representative gradient-based bi-level optimization methods.

sample another element w j , we can calculate the value of c j and obtain the stochastic gradient of L w.r.t. p as follows ∇p L(w t , λ t , p t , µ k

Datasets used in the experiments. Training data poisoning: In this experiment, we evaluate the performance of all the methods in training data poisoning. Assume we have pure training data {x i } Ntr i=1 with several poisoned points

The test mse of all the methods in data reweight. (Lower is better.) ± 0.006 2.147 ± 0.012 2.147 ± 0.011 2.171 ± 0.004 2.203 ± 0.012 MNIST 1.338 ± 0.004 1.339 ± 0.006 1.340 ± 0.007 1.345 ± 0.010 1.412 ± 0.076 FashionMNIST 1.091 ± 0.011 1.096 ± 0.020 1.100 ± 0.001 1.104 ± 0.009 1.097 ± 0.013 SVHN 2.138 ± 0.004 2.176 ± 0.002 2.184 ± 0.006 2.142 ± 0.004 2.165 ± 0.002

Test accuracy (%) of all the methods in training data poisoning (lower is better). ± 0.39 50.67 ± 0.27 50.67 ± 0.27 50.62 ± 0.29 48.85 ± 0.57 Cifar10 82.91 ± 0.18 83.25 ± 0.11 82.29 ± 0.11 82.57 ± 0.11 82.22 ± 0.28 FashionMNIST 96.09 ± 0.07 95.89 ± 0.31 95.87 ± 0.19 96.01 ± 0.22 95.80 ± 0.20 MNIST 80.27 ± 0.25 77.63 ± 0.08 77.43 ± 0.54 77.50 ± 0.30 77.22 ± 0.08 5.3 RESULTS AND DISCUSSION

