PROVABLY FASTER ALGORITHMS FOR BILEVEL OPTI-MIZATION AND APPLICATIONS TO META-LEARNING

Abstract

Bilevel optimization has arisen as a powerful tool for many machine learning problems such as meta-learning, hyperparameter optimization, and reinforcement learning. In this paper, we investigate the nonconvex-strongly-convex bilevel optimization problem. For deterministic bilevel optimization, we provide a comprehensive finite-time convergence analysis for two popular algorithms respectively based on approximate implicit differentiation (AID) and iterative differentiation (ITD). For the AID-based method, we orderwisely improve the previous finitetime convergence analysis due to a more practical parameter selection as well as a warm start strategy, and for the ITD-based method we establish the first theoretical convergence rate. Our analysis also provides a quantitative comparison between ITD and AID based approaches. For stochastic bilevel optimization, we propose a novel algorithm named stocBiO, which features a sample-efficient hypergradient estimator using efficient Jacobian-and Hessian-vector product computations. We provide the finite-time convergence guarantee for stocBiO, and show that stocBiO outperforms the best known computational complexities orderwisely with respect to the condition number κ and the target accuracy . We further validate our theoretical results and demonstrate the efficiency of bilevel optimization algorithms by the experiments on meta-learning and hyperparameter optimization.

1. INTRODUCTION

Bilevel optimization has received significant attention recently and become an influential framework in various machine learning applications including meta-learning (Franceschi et al., 2018; Bertinetto et al., 2018; Rajeswaran et al., 2019; Ji et al., 2020a) , hyperparameter optimization (Franceschi et al., 2018; Shaban et al., 2019; Feurer & Hutter, 2019) , reinforcement learning (Konda & Tsitsiklis, 2000; Hong et al., 2020) , and signal processing (Kunapuli et al., 2008; Flamary et al., 2014) . A general bilevel optimization takes the following formulation. min x∈R p Φ(x) := f (x, y * (x)) s.t. y * (x) = arg min y∈R q g(x, y), where the upper-and inner-level functions f and g are both jointly continuously differentiable. The goal of eq. ( 1) is to minimize the objective function Φ(x) w.r.t. x, where y * (x) is obtained by solving the lower-level minimization problem. In this paper, we focus on the setting where the lower-level function g is strongly convex with respect to (w.r.t.) y, and the upper-level objective function Φ(x) is nonconvex w.r.t. x. Such types of geometrics commonly exist in many applications including meta-learning and hyperparameter optimization, where g corresponds to an empirical loss with a strongly-convex regularizer and x are parameters of neural networks. A broad collection of algorithms have been proposed to solve such types of bilevel optimization problems. For example, Hansen et al. (1992); Shi et al. (2005) ; Moore (2010) reformulated the bilevel problem in eq. ( 1) into a single-level constrained problem based on the optimality conditions of the lower-level problem. However, such type of methods often involve a large number of constraints, and are hard to implement in machine learning applications. Recently, more efficient gradient-based bilevel optimization algorithms have been proposed, which can be generally categorized into the approximate implicit differentiation (AID) based approach (Domke, 2012; Pedregosa, 2016; Gould et al., 2016; Liao et al., 2018; Ghadimi & Wang, 2018; Grazzi et al., 2020; Lorraine et al., 2020) and the iterative differentiation (ITD) based approach (Domke, 2012; Maclaurin et al., 2015; Franceschi et al., 2017; 2018; Shaban et al., 2019; Grazzi et al., 2020) . However, most of these studies have focused on the asymptotic convergence analysis, and the finite-time analysis (that characterizes how fast an algorithm converges) has not been well explored except a few attempts recently. Ghadimi & Wang (2018) provided the finite-time analysis for the ITD-based approach. Grazzi et al. ( 2020) provided the iteration complexity for the hypergradient computation via ITD and AID, but did not characterize the finite-time convergence for the entire execution of algorithms. • Thus, the first focus of this paper is to develop a comprehensive and enhanced theory, which covers a broader class of bilevel optimizers via ITD and AID based techniques, and more importantly, to improve the exiting analysis with a more practical parameter selection and order-level lower computational complexity. The stochastic bilevel optimization often occurs in applications where fresh data need to be sampled as the algorithms run (e.  min x∈R p Φ(x) = f (x, y * (x)) := E ξ [F (x, y * (x); ξ)] 1 n n i=1 F (x, y * (x); ξ i ) s.t. y * (x) = arg min y∈R q g(x, y) := E ζ [G(x, y * (x); ζ)] 1 m m i=1 G(x, y * (x); ζ i ), where f (x, y) and g(x, y) take either the expectation form w.r. 2020) further proposed a two-timescale stochastic approximation (TTSA), and showed that TTSA achieves a better trade-off between the complexities of inner-and outer-loop optimization stages than BSA. • The second focus of this paper is to design a more sample-efficient algorithm for bilevel stochastic optimization, which achieves an order-level lower computational complexity over BSA and TTSA.

1.1. MAIN CONTRIBUTIONS

Our main contributions lie in developing enhanced theory and provably faster algorithms for the nonconvex-strongly-convex bilevel deterministic and stochastic optimization problems, respectively. Our analysis involves several new developments, which can be of independent interest. We first provide a unified finite-time convergence and complexity analysis for both ITD and AID based bilevel optimizers, which we call as ITD-BiO and AID-BiO. Compared to existing analysis in Ghadimi & Wang (2018) for AID-BiO that requires a continuously increasing number of innerloop steps to achieve the guarantee, our analysis allows a constant number of inner-loop steps as often used in practice. In addition, we introduce a warm start initialization for the inner-loop updates and the outer-loop hypergradient estimation, which allows us to backpropagate the tracking errors to previous loops, and results in an improved computational complexity. As shown in Table 1 , the gradient complexities Gc(f, ), Gc(g, ), and Jacobian-and Hessian-vector product complexities JV(g, ) and HV(g, ) of AID-BiO to attain an -accurate stationary point improve those of Ghadimi & Wang (2018) by the order of κ, κ -1/4 , κ, and κ, respectively, where κ is the condition number. In addition, our analysis shows that AID-BiO requires less computations of Jacobian-and Hessianvector products than ITD-BiO by an order of κ and κ 1/2 , which provides a justification for the observation in Grazzi et al. (2020) that ITD often has a larger memory cost than AID. We then propose a stochastic bilevel optimizer (stocBiO) to solve the stochastic bilevel optimization problem in eq. ( 2). Our algorithm features a mini-batch hyper-gradient estimation via implicit differentiation, where the core design involves a sample-efficient Hypergradient estimator via the Neumann series. As shown in Table 2 , the gradient complexities of our proposed algorithm w.r.t. F



t. the random variables ξ and ζ or the finite-sum form over given data D n,m = {ξ i , ζ j , i = 1, ..., n; j = 1, ..., m} often with large sizes n and m. During the optimization process, the algorithms sample data batch via the distributions of ξ and ζ or from the set D n,m . For such a stochastic setting, Ghadimi & Wang (2018) proposed a bilevel stochastic approximation (BSA) method via single-sample gradient and Hessian estimates. Based on such a method, Hong et al. (

g., reinforcement learning (Hong et al., 2020)) or the sample size of training data is large (e.g., hyperparameter optimization (Franceschi et al., 2018), Stackelberg game (Roth et al., 2016)). Typically, the corresponding objective function is given by

