PROVABLY FASTER ALGORITHMS FOR BILEVEL OPTI-MIZATION AND APPLICATIONS TO META-LEARNING

Abstract

Bilevel optimization has arisen as a powerful tool for many machine learning problems such as meta-learning, hyperparameter optimization, and reinforcement learning. In this paper, we investigate the nonconvex-strongly-convex bilevel optimization problem. For deterministic bilevel optimization, we provide a comprehensive finite-time convergence analysis for two popular algorithms respectively based on approximate implicit differentiation (AID) and iterative differentiation (ITD). For the AID-based method, we orderwisely improve the previous finitetime convergence analysis due to a more practical parameter selection as well as a warm start strategy, and for the ITD-based method we establish the first theoretical convergence rate. Our analysis also provides a quantitative comparison between ITD and AID based approaches. For stochastic bilevel optimization, we propose a novel algorithm named stocBiO, which features a sample-efficient hypergradient estimator using efficient Jacobian-and Hessian-vector product computations. We provide the finite-time convergence guarantee for stocBiO, and show that stocBiO outperforms the best known computational complexities orderwisely with respect to the condition number κ and the target accuracy . We further validate our theoretical results and demonstrate the efficiency of bilevel optimization algorithms by the experiments on meta-learning and hyperparameter optimization.

1. INTRODUCTION

Bilevel optimization has received significant attention recently and become an influential framework in various machine learning applications including meta-learning (Franceschi et al., 2018; Bertinetto et al., 2018; Rajeswaran et al., 2019; Ji et al., 2020a ), hyperparameter optimization (Franceschi et al., 2018; Shaban et al., 2019; Feurer & Hutter, 2019) , reinforcement learning (Konda & Tsitsiklis, 2000; Hong et al., 2020) , and signal processing (Kunapuli et al., 2008; Flamary et al., 2014) . A general bilevel optimization takes the following formulation. min x∈R p Φ(x) := f (x, y * (x)) s.t. y * (x) = arg min y∈R q g(x, y), where the upper-and inner-level functions f and g are both jointly continuously differentiable. The goal of eq. ( 1) is to minimize the objective function Φ(x) w.r.t. x, where y * (x) is obtained by solving the lower-level minimization problem. In this paper, we focus on the setting where the lower-level function g is strongly convex with respect to (w.r.t.) y, and the upper-level objective function Φ(x) is nonconvex w.r.t. x. Such types of geometrics commonly exist in many applications including meta-learning and hyperparameter optimization, where g corresponds to an empirical loss with a strongly-convex regularizer and x are parameters of neural networks. A broad collection of algorithms have been proposed to solve such types of bilevel optimization problems. For example, Hansen et al. (1992); Shi et al. (2005) ; Moore (2010) reformulated the bilevel problem in eq. ( 1) into a single-level constrained problem based on the optimality conditions of the lower-level problem. However, such type of methods often involve a large number of constraints, and are hard to implement in machine learning applications. Recently, more efficient gradient-based bilevel optimization algorithms have been proposed, which can be generally categorized into the approximate implicit differentiation (AID) based approach (Domke, 2012; Pedregosa, 2016; Gould et al., 2016; Liao et al., 2018; Ghadimi & Wang, 2018; Grazzi et al., 2020; Lorraine 

