BIADAM: FAST ADAPTIVE BILEVEL OPTIMIZATION METHODS

Abstract

Bilevel optimization recently has attracted increased interest in machine learning due to its many applications such as hyper-parameter optimization and meta learning. Although many bilevel optimization methods recently have been proposed, these methods do not consider using adaptive learning rates. It is well known that adaptive learning rates can accelerate many optimization algorithms including (stochastic) gradient-based algorithms. To fill this gap, in the paper, we propose a novel fast adaptive bilevel framework for solving bilevel optimization problems that the outer problem is possibly nonconvex and the inner problem is strongly convex. Our framework uses unified adaptive matrices including many types of adaptive learning rates, and can flexibly use the momentum and variance reduced techniques. In particular, we provide a useful convergence analysis framework for the bilevel optimization. Specifically, we propose a fast singleloop adaptive bilevel optimization (BiAdam) algorithm based on the basic momentum technique, which achieves a sample complexity of Õ(ϵ -4 ) for finding an ϵ-stationary point. Meanwhile, we propose an accelerated version of BiAdam algorithm (VR-BiAdam) by using variance reduced technique, which reaches the best known sample complexity of Õ(ϵ -3 ) without relying on large batch-size. To the best of our knowledge, we first study the adaptive bilevel optimization methods with adaptive learning rates. Some experimental results on data hypercleaning and hyper-representation learning tasks demonstrate the efficiency of the proposed algorithms.

1. INTRODUCTION

Bilevel optimization is known as a class of popular hierarchical optimization, which has been applied to a wide range of machine learning problems such as hyperparameter optimization Shaban et al. (2019 ), meta-learning Ji et al. (2021); Liu et al. (2021a) and policy optimization Hong et al. (2020) . In the paper, we consider solving the following stochastic bilevel optimization problem, defined as min x∈X F (x) := E ξ∼D f x, y * (x); ξ (Outer) (1) s.t. y * (x) ∈ arg min y∈Y E ζ∼M g(x, y; ζ) , where  F (x) = f (x, y * (x)) = E ξ f (x, y * (x); ξ) is a differentiable



and possibly nonconvex function, and g(x, y) = E ζ g(x, y; ζ) is a differentiable and strongly convex function in variable y, and ξ and ζ are random variables follow unknown distributions D and M, respectively. Here X ⊆ R d and Y ⊆ R p are convex closed sets. Problem (1) involves many machine learning problems with a hierarchical structure, which include hyper-parameter optimization Franceschi et al. (2018), metalearning Franceschi et al. (2018), policy optimization Hong et al. (2020) and neural network architecture search Liu et al. (2018). Since bilevel optimization has been widely applied in machine learning, some works recently have been begun to study the bilevel optimization. For example, Ghadimi & Wang (2018); Ji et al. (2021) proposed a class of double-loop methods to solve the problem (1). However, to obtain an accurate estimate, the BSA in Ghadimi & Wang (2018) needs to solve the inner problem to a high accuracy, and the stocBiO in Ji et al. (2021) requires large batch-sizes in solving the inner problem. 1

