EFFICIENT METHOD FOR BI-LEVEL OPTIMIZATION WITH NON-SMOOTH LOWER-LEVEL PROBLEM

Abstract

Bi-level optimization plays a key role in a lot of machine learning applications. Existing state-of-the-art bi-level optimization methods are limited to smooth or some specific non-smooth lower-level problems. Therefore, achieving an efficient algorithm for the bi-level problems with a generalized non-smooth lower-level objective is still an open problem. To address this problem, in this paper, we propose a new bi-level optimization algorithm based on smoothing and penalty techniques. Using the theory of generalized directional derivative, we derive new conditions for the bilevel optimization problem with nonsmooth, perhaps non-Lipschitz lower-level problem, and prove our method can converge to the points satisfying these conditions. We also compare our method with existing state-ofthe-art bi-level optimization methods and demonstrate that our method is superior to the others in terms of accuracy and efficiency.

1. INTRODUCTION

Bi-level optimization (BO) (Bard, 2013; Colson et al., 2007) plays a central role in various machine learning applications including hyper-parameter optimization (Pedregosa, 2016; Bergstra et al., 2011; Bertsekas, 1976 ), meta-learning (Feurer et al., 2015; Franceschi et al., 2018; Rajeswaran et al., 2019) , reinforcement learning (Hong et al., 2020; Konda & Tsitsiklis, 2000) . It involves a competition between two parties or two objectives, and if one party makes its choice first it will affect the optimal choice for the other party. Several approaches, such as Bayesian optimization (Klein et al., 2017) , random search (Bergstra & Bengio, 2012) , evolution strategy (Sinha et al., 2017) , gradient-based methods (Pedregosa, 2016; Maclaurin et al., 2015; Swersky et al., 2014) , have bee proposed to solve BO problems, among which gradient-based methods have become the mainstream for large-scale BO problems. The key idea of the gradient-based method is to approximate the gradient of upper-level variables, called hypergradient. For example, the implicit differentiation methods (Pedregosa, 2016; Rajeswaran et al., 2019) use the first derivative of the lower-level problem to be 0 to derive the hypergradient. The explicit differentiation methods calculate the gradient of the update rules of the lower-level based on chain rule (Maclaurin et al., 2015; Domke, 2012; Franceschi et al., 2017; Swersky et al., 2014) to approximate the hypergradient. Mehra & Hamm (2019) reformulate the bilevel problem as a single-level constrained problem by replacing the lower level problem with its first-order necessary conditions, and then solve the new problem by using the penalty method. Obviously, all these methods need the lower-level problem to be smooth. However, in many real-world applications, such as image restoration (Chen et al.; Nikolova et al., 2008) , variable selection (Fan & Li, 2001; Huang et al., 2008; Zhang et al., 2010) and signal processing (Bruckstein et al., 2009) , the objective may have a complicated non-smooth, perhaps non-Lipschitz term (Bian & Chen, 2017). Traditional methods cannot be directly used to solve the bilevel problem with such a lower-level problem. To solve the BO problems with some specific nonsmooth lower-level problems, researchers have proposed several algorithms based on the above-mentioned methods. Specifically, Bertrand et al. (2020) searched the regularization parameters for LASSO-type problems by approximating the hypergradient from the soft thresholding function (Donoho, 1995; Bredies & Lorenz, 2008; Beck & Teboulle, 2009) . Frecon et al. (2018) proposed a primal-dual FMD-based method, called FBBGLasso, to search the group structures of group-LASSO problems. Okuno et al. (2021) used the smoothing method and constrained optimization method to search the regularization

