BIADAM: FAST ADAPTIVE BILEVEL OPTIMIZATION METHODS

Abstract

Bilevel optimization recently has attracted increased interest in machine learning due to its many applications such as hyper-parameter optimization and meta learning. Although many bilevel optimization methods recently have been proposed, these methods do not consider using adaptive learning rates. It is well known that adaptive learning rates can accelerate many optimization algorithms including (stochastic) gradient-based algorithms. To fill this gap, in the paper, we propose a novel fast adaptive bilevel framework for solving bilevel optimization problems that the outer problem is possibly nonconvex and the inner problem is strongly convex. Our framework uses unified adaptive matrices including many types of adaptive learning rates, and can flexibly use the momentum and variance reduced techniques. In particular, we provide a useful convergence analysis framework for the bilevel optimization. Specifically, we propose a fast singleloop adaptive bilevel optimization (BiAdam) algorithm based on the basic momentum technique, which achieves a sample complexity of Õ(ϵ -4 ) for finding an ϵ-stationary point. Meanwhile, we propose an accelerated version of BiAdam algorithm (VR-BiAdam) by using variance reduced technique, which reaches the best known sample complexity of Õ(ϵ -3 ) without relying on large batch-size. To the best of our knowledge, we first study the adaptive bilevel optimization methods with adaptive learning rates. Some experimental results on data hypercleaning and hyper-representation learning tasks demonstrate the efficiency of the proposed algorithms.

1. INTRODUCTION

Bilevel optimization is known as a class of popular hierarchical optimization, which has been applied to a wide range of machine learning problems such as hyperparameter optimization Shaban et al. (2019 ), meta-learning Ji et al. (2021) ; Liu et al. (2021a) and policy optimization Hong et al. (2020) . In the paper, we consider solving the following stochastic bilevel optimization problem, defined as min x∈X F (x) := E ξ∼D f x, y * (x); ξ (Outer) (1) s.t. y * (x) ∈ arg min y∈Y E ζ∼M g(x, y; ζ) , where 2021) et al., 2014) . Although these methods can effectively solve the bilevel problems, they do not consider using the adaptive learning rates and only consider the bilevel problems under unconstrained setting. Since using generally different learning rates for the inner and outer problems to ensure the convergence of bilevel optimization problems, we will consider using different adaptive learning rates for the inner and outer problems with convergence guarantee. Clearly, this can not follow the exiting adaptive methods for single-level problems. Thus, there exists a natural question: F (x) = f (x, y * (x)) = E ξ f (x, y * (x); ξ) Õ(ϵ -3 ) Õ(ϵ -2 ) Double N, N p 2 1, 6, 7 BiAdam Ours Õ(ϵ -4 ) Õ(1) Single Y/N, Y/N p 2 √ 2, 5, 7 VR-BiAdam Ours Õ(ϵ -3 ) Õ(1) Single Y/N, Y/N p 2 √ 1, How to design the effective optimization methods with adaptive learning rates for the bilevel problems ? In the paper, we provide an affirmative answer to this question and propose a class of fast singleloop adaptive bilevel optimization methods based on unified adaptive matrices, which including many types of adaptive learning rates. Moreover, our framework can flexibly use the momentum and variance reduced techniques. Our main contributions are summarized as follows: 1) We propose a fast single-loop adaptive bilevel optimization algorithm (BiAdam) based on the basic momentum technique, which achieves a sample complexity of Õ(ϵ -4 ) for finding an ϵ-stationary point. 2) Meanwhile, we propose a single-loop accelerated version of BiAdam algorithm (VR-BiAdam) by using the momentum-based variance reduced technique, which reaches the best known sample complexity of Õ(ϵ -3 ). 3) Moreover, we provide a useful convergence analysis framework for both the constrained and unconstrained bilevel programming under some mild conditions (Please see Table 1 ). 4) The experimental results on hyper-parameter learning demonstrate the efficiency of the proposed algorithms.



Hong et al. (2020) proposed a class of single-loop methods to solve the bilevel problems. Subsequently, Khanduri et al. (2021); Guo & Yang (2021); Yang et al. (2021); Chen et al. (2022) presented some accelerated single-loop methods by using the momentum-based variance reduced technique of STORM Cutkosky & Orabona (2019). More recently, Dagréou et al. (2022) developed a novel framework for bilevel optimization based on the linear system, and proposed a fast SABA algorithm for finite-sum bilevel problems based on the varaince reduced technique of SAGA (Defazio

Sample complexity of the representative bilevel optimization methods for finding an ϵstationary point of the bilevel problem (1), i.e., E∥∇F (x)∥ ≤ ϵ or its equivalent variants. BSize denotes mini-batch size; ALR denotes adaptive learning rate. C(x, y) denotes the constraint sets in x and y, where Y denotes the fact that there exists a convex constraint on variable, otherwise is N. DD denotes dimension dependence in the gradient estimators, and p denotes the dimension of variable y. 1 denotes Lipschitz continuous of ∇ denotes the bounded true partial derivatives ∇ y f (x, y) and ∇ 2 xy g(x, y); 6 denotes Lipschitz continuous of function f (x, y; ξ); 7 denotes g(x, y; ζ) is L g -smooth and µ-strongly convex function w.r.t. y for all ζ; 8 denotes g(x, y) is L g -smooth and µ-strongly convex function w.r.t. y.

