EFFICIENT METHOD FOR BI-LEVEL OPTIMIZATION WITH NON-SMOOTH LOWER-LEVEL PROBLEM

Abstract

Bi-level optimization plays a key role in a lot of machine learning applications. Existing state-of-the-art bi-level optimization methods are limited to smooth or some specific non-smooth lower-level problems. Therefore, achieving an efficient algorithm for the bi-level problems with a generalized non-smooth lower-level objective is still an open problem. To address this problem, in this paper, we propose a new bi-level optimization algorithm based on smoothing and penalty techniques. Using the theory of generalized directional derivative, we derive new conditions for the bilevel optimization problem with nonsmooth, perhaps non-Lipschitz lower-level problem, and prove our method can converge to the points satisfying these conditions. We also compare our method with existing state-ofthe-art bi-level optimization methods and demonstrate that our method is superior to the others in terms of accuracy and efficiency.

1. INTRODUCTION

Bi-level optimization (BO) (Bard, 2013; Colson et al., 2007) plays a central role in various machine learning applications including hyper-parameter optimization (Pedregosa, 2016; Bergstra et al., 2011; Bertsekas, 1976) , meta-learning (Feurer et al., 2015; Franceschi et al., 2018; Rajeswaran et al., 2019) , reinforcement learning (Hong et al., 2020; Konda & Tsitsiklis, 2000) . It involves a competition between two parties or two objectives, and if one party makes its choice first it will affect the optimal choice for the other party. Several approaches, such as Bayesian optimization (Klein et al., 2017) , random search (Bergstra & Bengio, 2012) , evolution strategy (Sinha et al., 2017) , gradient-based methods (Pedregosa, 2016; Maclaurin et al., 2015; Swersky et al., 2014) , have bee proposed to solve BO problems, among which gradient-based methods have become the mainstream for large-scale BO problems. The key idea of the gradient-based method is to approximate the gradient of upper-level variables, called hypergradient. For example, the implicit differentiation methods (Pedregosa, 2016; Rajeswaran et al., 2019) use the first derivative of the lower-level problem to be 0 to derive the hypergradient. The explicit differentiation methods calculate the gradient of the update rules of the lower-level based on chain rule (Maclaurin et al., 2015; Domke, 2012; Franceschi et al., 2017; Swersky et al., 2014) to approximate the hypergradient. Mehra & Hamm (2019) reformulate the bilevel problem as a single-level constrained problem by replacing the lower level problem with its first-order necessary conditions, and then solve the new problem by using the penalty method. Obviously, all these methods need the lower-level problem to be smooth. However, in many real-world applications, such as image restoration (Chen et al.; Nikolova et al., 2008) , variable selection (Fan & Li, 2001; Huang et al., 2008; Zhang et al., 2010) and signal processing (Bruckstein et al., 2009) , the objective may have a complicated non-smooth, perhaps non-Lipschitz term (Bian & Chen, 2017) . Traditional methods cannot be directly used to solve the bilevel problem with such a lower-level problem. To solve the BO problems with some specific nonsmooth lower-level problems, researchers have proposed several algorithms based on the above-mentioned methods. Specifically, Bertrand et al. (2020) searched the regularization parameters for LASSO-type problems by approximating the hypergradient from the soft thresholding function (Donoho, 1995; Bredies & Lorenz, 2008; Beck & Teboulle, 2009) . Frecon et al. (2018) proposed a primal-dual FMD-based method, called FBBGLasso, to search the group structures of group-LASSO problems. Okuno et al. (2021) used the smoothing method and constrained optimization method to search the regularization parameter of q-norm (0 < q ≤ 1) and provided the convergence analysis of their method. We summarize several representative methods in Table 1 . Obviously, all these methods and their theoretic analysis only focus on some specific problem and can not be used to solve the bilevel problem with a generalized nonsmoothed lower-level problem. Therefore, how to solve the BO problem with a generalized non-smooth lower-level objective and obtain its convergence analysis are still open problems. To address this problem, in this paper, we propose a new algorithm, called SPNBO, based on smoothing (Nesterov, 2005; Chen et al., 2013) and penalty (Wright & Nocedal, 1999) techniques. Specifically, we use the smoothing technique to approximate the original non-Lipschitz lower-level problem and generate a sequence of smoothed bi-level problems. Then, a single-level constrained problem is obtained by replacing the smoothed lower-level objective with its first-order necessary condition. For each given smoothing parameter, we propose a stochastic constraint optimization method to solve the single-level constrained problem to avoid calculating the Hessian matrix of the lower-level problem. Theoretically, using the theory of generalized directional derivative, we derive new conditions for the bilevel optimization problem with nonsmooth, perhaps non-Lipschitz lower-level problem, and prove our method can converge to the points satisfying these conditions. We also compare our method with several state-of-the-art bi-level optimization methods, and the experimental results demonstrate that our method is superior to the others in terms of accuracy and efficiency. Contributions. We summarize the main contributions of this paper as follows: 1. We propose a new method to solve the non-Lipschitz bilevel optimization problem based on the penalty method and smoothing method. By using the stochastic constraint method, our method can avoid calculating the Hessian matrix of the lower-level problem, which makes our method a lower time complexity. 2. Based on the Clarke generalized directional derivative, we propose new conditions for the bilevel problem with a generalized non-smoothed lower-level problem. We prove that our method can converge to the proposed conditions.

2. PRELIMINARIES 2.1 FORMULATION OF NON-SMOOTH BI-LEVEL OPTIMIZATION PROBLEM

In this paper, we consider the following non-smooth bi-level optimization problem: min λ f (w * , λ) s.t. w * ∈ arg min w g(w, λ) + exp(λ 1 )φ(h(w)), where  λ := [λ 1 , λ 2 , • • • , λ m ] T ∈ R m , λ := [λ 2 , • • • , λ m ] T and w ∈ R d . f : R d × R m → R



and g : R d × R m → R are twice continuously differentiable on w and λ. φ(•) : R n → R is twice continuously differentiable. h(•) : R d → R n is continuous, not necessarily convex, not differentiable, or even not Lipschitz at some points. Assume h(w) := (h 1 (D T 1 w), h 2 (D T 2 w), • • • , h n (D T n w)), where D i ∈ R d×r and h i : R d → R (i = 1, 2, • • • , n) is continuous.For a fixed point w, assume we have an index set I w = {i ∈ {1, 2, • • • , n} : h i is not Lipschitz continuous at D T i w} and if i ̸ ∈ I w, h i is twice continuously differentiable.

Representative gradient-based bi-level optimization methods.

