BIADAM: FAST ADAPTIVE BILEVEL OPTIMIZATION METHODS

Abstract

Bilevel optimization recently has attracted increased interest in machine learning due to its many applications such as hyper-parameter optimization and meta learning. Although many bilevel optimization methods recently have been proposed, these methods do not consider using adaptive learning rates. It is well known that adaptive learning rates can accelerate many optimization algorithms including (stochastic) gradient-based algorithms. To fill this gap, in the paper, we propose a novel fast adaptive bilevel framework for solving bilevel optimization problems that the outer problem is possibly nonconvex and the inner problem is strongly convex. Our framework uses unified adaptive matrices including many types of adaptive learning rates, and can flexibly use the momentum and variance reduced techniques. In particular, we provide a useful convergence analysis framework for the bilevel optimization. Specifically, we propose a fast singleloop adaptive bilevel optimization (BiAdam) algorithm based on the basic momentum technique, which achieves a sample complexity of Õ(ϵ -4 ) for finding an ϵ-stationary point. Meanwhile, we propose an accelerated version of BiAdam algorithm (VR-BiAdam) by using variance reduced technique, which reaches the best known sample complexity of Õ(ϵ -3 ) without relying on large batch-size. To the best of our knowledge, we first study the adaptive bilevel optimization methods with adaptive learning rates. Some experimental results on data hypercleaning and hyper-representation learning tasks demonstrate the efficiency of the proposed algorithms. How to design the effective optimization methods with adaptive learning rates for the bilevel problems ? In the paper, we provide an affirmative answer to this question and propose a class of fast singleloop adaptive bilevel optimization methods based on unified adaptive matrices, which including many types of adaptive learning rates. Moreover, our framework can flexibly use the momentum and variance reduced techniques. Our main contributions are summarized as follows: 1) We propose a fast single-loop adaptive bilevel optimization algorithm (BiAdam) based on the basic momentum technique, which achieves a sample complexity of Õ(ϵ -4 ) for finding an ϵ-stationary point. 2) Meanwhile, we propose a single-loop accelerated version of BiAdam algorithm (VR-BiAdam) by using the momentum-based variance reduced technique, which reaches the best known sample complexity of Õ(ϵ -3 ). 3) Moreover, we provide a useful convergence analysis framework for both the constrained and unconstrained bilevel programming under some mild conditions (Please see Table 1 ). 4) The experimental results on hyper-parameter learning demonstrate the efficiency of the proposed algorithms.

1. INTRODUCTION

Bilevel optimization is known as a class of popular hierarchical optimization, which has been applied to a wide range of machine learning problems such as hyperparameter optimization Shaban et al. (2019) , meta-learning Ji et al. (2021) ; Liu et al. (2021a) and policy optimization Hong et al. (2020) . In the paper, we consider solving the following stochastic bilevel optimization problem, defined as where F (x) = f (x, y * (x)) = E ξ f (x, y * (x); ξ) is a differentiable and possibly nonconvex function, and g(x, y) = E ζ g(x, y; ζ) is a differentiable and strongly convex function in variable y, and ξ and ζ are random variables follow unknown distributions D and M, respectively. Here X ⊆ R d and Y ⊆ R p are convex closed sets. Problem (1) involves many machine learning problems with a hierarchical structure, which include hyper-parameter optimization Franceschi et al. (2018) , metalearning Franceschi et al. (2018) , policy optimization Hong et al. (2020) and neural network architecture search Liu et al. (2018) . Since bilevel optimization has been widely applied in machine learning, some works recently have been begun to study the bilevel optimization. For example, Ghadimi & Wang (2018) ; Ji et al. (2021) proposed a class of double-loop methods to solve the problem (1). However, to obtain an accurate estimate, the BSA in Ghadimi & Wang (2018) needs to solve the inner problem to a high accuracy, and the stocBiO in Ji et al. (2021) requires large batch-sizes in solving the inner problem. Table 1 : Sample complexity of the representative bilevel optimization methods for finding an ϵstationary point of the bilevel problem (1), i.e., E∥∇F (x)∥ ≤ ϵ or its equivalent variants. BSize denotes mini-batch size; ALR denotes adaptive learning rate. C(x, y) denotes the constraint sets in x and y, where Y denotes the fact that there exists a convex constraint on variable, otherwise is N. DD denotes dimension dependence in the gradient estimators, and p denotes the dimension of variable y. 1 denotes Lipschitz continuous of ∇ x f (x, y; ξ), ∇ y f (x, y; ξ), ∇ y g(x, y; ζ), ∇ 2 xy g(x, y; ζ) and ∇ 2 yy g(x, y; ζ) for all ξ, ζ; 2 denotes Lipschitz continuous of ∇ x f (x, y), ∇ y f (x, y), ∇ y g(x, y), ∇ 2 xy g(x, y) and ∇ 2 yy g(x, y); 3 denotes bounded stochastic partial derivatives ∇ y f (x, y; ξ) and ∇ 2 xy g(x, y; ζ); 4 denotes bounded stochastic partial derivatives ∇ x f (x, y; ξ), and ∇ 2 yy g(x, y; ζ); 5 denotes the bounded true partial derivatives ∇ y f (x, y) and ∇ 2 xy g(x, y); 6 denotes Lipschitz continuous of function f (x, y; ξ); 7 denotes g(x, y; ζ) is L g -smooth and µ-strongly convex function w.r.t. y for all ζ; 8 denotes g(x, y) is L g -smooth and µ-strongly convex function w.r.t. y. Hong et al. ( 2020) proposed a class of single-loop methods to solve the bilevel problems. Subsequently, Khanduri et al. (2021) ; Guo & Yang (2021) ; Yang et al. (2021) ; Chen et al. (2022) presented some accelerated single-loop methods by using the momentum-based variance reduced technique of STORM Cutkosky & Orabona (2019) . More recently, Dagréou et al. (2022) developed a novel framework for bilevel optimization based on the linear system, and proposed a fast SABA algorithm for finite-sum bilevel problems based on the varaince reduced technique of SAGA (Defazio et al., 2014) . Although these methods can effectively solve the bilevel problems, they do not consider using the adaptive learning rates and only consider the bilevel problems under unconstrained setting. Since using generally different learning rates for the inner and outer problems to ensure the convergence of bilevel optimization problems, we will consider using different adaptive learning rates for the inner and outer problems with convergence guarantee. Clearly, this can not follow the exiting adaptive methods for single-level problems. Thus, there exists a natural question:

2. PRELIMINARIES

2.1 NOTATIONS U{1, 2, • • • , K} denotes a uniform distribution over a discrete set {1, 2, • • • , K}. ∥ • ∥ denotes the ℓ 2 norm for vectors and spectral norm for matrices. ⟨x, y⟩ denotes the inner product of two vectors x and y. For vectors x and y, x r (r > 0) denotes the element-wise power operation, x/y denotes the element-wise division and max(x, y) denotes the element-wise maximum. I d denotes a d-dimensional identity matrix. A ≻ 0 denotes that the matrix A is positive definite. Given function f (x, y), f (x, •) denotes function w.r.t. the second variable with fixing x, and f (•, y) denotes function w.r.t. the first variable with fixing y. a = O(b) denotes that a ≤ Cb for some constant C > 0. The notation Õ(•) hides logarithmic terms. Given a convex closed set X , we define a projection operation to X as P X (z) = arg min x∈X 1 2 ∥x -z∥ 2 .

2.2. SOME MILD ASSUMPTIONS

In this subsection, we give some mild assumptions on the problem (1). Assumption 1. For any x and ζ, g(x, y; ζ) is L g -smooth and µ-strongly convex function, i.e., L g I p ⪰ ∇ 2 yy g(x, y; ζ) ⪰ µI p . Assumption 2. For functions f (x, y) and g(x, y) for all x ∈ X and y ∈ Y, we assume the following conditions hold: ∇ x f (x, y) and ∇ y f (x, y) are L f -Lipschitz continuous, ∇ y g(x, y) is L g -Lipschitz continuous, ∇ 2 xy g(x, y) is L gxy -Lipschitz continuous, ∇ 2 yy g(x, y) is L gyy -Lipschitz continuous. For example, for all x, x 1 , x 2 ∈ X and y, y 1 , y 2 ∈ Y, we have ∥∇ x f (x 1 , y) -∇ x f (x 2 , y)∥ ≤ L f ∥x 1 -x 2 ∥, ∥∇ x f (x, y 1 ) -∇ x f (x, y 2 )∥ ≤ L f ∥y 1 -y 2 ∥. Assumption 3. For functions f (x, y; ξ) and g(x, y; ζ) for all x ∈ X , y ∈ Y, ξ and ζ, we assume the following conditions hold: ∇ x f (x, y; ξ) and ∇ y f (x, y; ξ) are L f -Lipschitz continuous, ∇ y g(x, y; ζ) is L g -Lipschitz continuous, ∇ 2 xy g(x, y; ζ) is L gxy -Lipschitz continuous, ∇ 2 yy g(x, y; ζ) is L gyy -Lipschitz continuous. For example, for all x, x 1 , x 2 ∈ X and y, y 1 , y 2 ∈ Y, we have ∥∇ x f (x 1 , y; ξ) -∇ x f (x 2 , y; ξ)∥ ≤ L f ∥x 1 -x 2 ∥, ∥∇ x f (x, y 1 ; ξ) -∇ x f (x, y 2 ; ξ)∥ ≤ L f ∥y 1 -y 2 ∥. Assumption 4. The partial derivatives ∇ y f (x, y) and ∇ 2 xy g(x, y) are bounded, i.e., ∥∇ y f (x, y)∥ 2 ≤ C 2 f y and ∥∇ 2 xy g(x, y)∥ 2 ≤ C 2 gxy . Assumption 5. Stochastic functions f (x, y; ξ) and g(x, y; ζ) have unbiased stochastic partial derivatives with bounded variance, e.g., E[∇ x f (x, y; ξ)] = ∇ x f (x, y), E∥∇ x f (x, y; ξ) -∇ x f (x, y)∥ 2 ≤ σ 2 . The same assumptions hold for ∇ y f (x, y; ξ), ∇ y g(x, y; ζ), ∇ 2 xy g(x, y; ζ) and ∇ 2 yy g(x, y; ζ). 2021). Note that Assumption 3 is clearly stricter than Assumption 2. For example, given Assumption 3, we have ∥∇ x f (x 1 , y) -∇ x f (x 2 , y)∥ = ∥E[∇ x f (x 1 , y; ξ) -∇ x f (x 2 , y; ξ)]∥ ≤ E∥∇ x f (x 1 , y; ξ) - ∇ x f (x 2 , y; ξ)∥ ≤ L f ∥x 1 -x 2 ∥∥ for any x, y, ξ. At the same time, based on Assumptions 4-5, we also have ∥∇ y f (x, y; ξ)∥ 2 = ∥∇ y f (x, y; ξ) -∇ y f (x, y) -∇ y f (x, y)∥ 2 ≤ 2∥∇ y f (x, y; ξ) - ∇ y f (x, y)∥ 2 +2∥∇ y f (x, y)∥ 2 ≤ 2σ 2 +2C 2 f y and ∥∇ 2 xy g(x, y; ζ)∥ 2 ≤ 2σ 2 +2C 2 gxy . Thus we argue that under Assumption 5, the bounded ∇ y f (x, y) and ∇ 2 xy g(x, y) are not milder than the bounded ∇ y f (x, y; ξ) and ∇ 2 xy g(x, y; ζ) for all ξ and ζ.

2.3. BILEVEL OPTIMIZATION

In this subsection, we review the basic first-order method for solving the problem (1). Naturally, we give the following iteration to update the variables x, y: at the t-th step y t+1 = P Y y t -λ∇ y g(x t , y t ) , x t+1 = P X x t -γ∇ x f (x t , y * (x t )) , where λ > 0 and γ > 0 denote the step sizes. Clearly, if there does not exist a closed form solution of the inner problem in the problem (1), i.e., y t+1 ̸ = y * (x t ), we can not easily obtain the gradient ∇F (x t ) = ∇f (x t , y * (x t )). Thus, one of key points in solving the problem (1) is to estimate the gradient ∇F (x t ). Lemma 1. (Lemma 2.1 in Ghadimi & Wang (2018) ) Under the above Assumption 2, we have, for any x ∈ X ∇F (x) = ∇ x f (x, y * (x)) -∇ 2 xy g(x, y * (x)) ∇ 2 yy g(x, y * (x)) -1 ∇ y f (x, y * (x)). From the above Lemma 1, it is natural to use the following form to estimate ∇F (x), defined as, ∇f (x, y) = ∇ x f (x, y) -∇ 2 xy g(x, y) ∇ 2 yy g(x, y) -1 ∇ y f (x, y), ∀x ∈ X , y ∈ Y Note that although the inner problem of the problem (1) is a constrained optimization, we assume that the optimal condition of the inner problem still is ∇ y g(x, y * (x)) = 0 and y * (x) ∈ Y. Lemma 2. (Lemma 2.2 in Ghadimi & Wang ( 2018)) Under the above Assumptions (1, 2, 4), for all x, x 1 , x 2 ∈ X and y ∈ Y, we have ∥ ∇f (x, y) -∇F (x)∥ ≤ L y ∥y * (x) -y∥, ∥y * (x 1 ) -y * (x 2 )∥ ≤ κ∥x 1 -x 2 ∥, ∥∇F (x 1 ) -∇F (x 2 )∥ ≤ L∥x 1 -x 2 ∥, where L y = L f + L f C gxy /µ + C f y L gxy /µ + L gyy C gxy /µ 2 , κ = C gxy /µ, and L = L f + (L f + L y )C gxy /µ + C f y L gxy /µ + L gyy C gxy /µ 2 . For the stochastic bilevel optimization, Yang et al. (2021) ; Hong et al. (2020) provided a stochastic estimator ∇F (x) as follows: ∇f (x, y; S) = ∇ x f (x, y; ξ) -∇ 2 xy g(x, y; ζ)ϑ Q-1 q=-1 Q i=Q-q I p -ϑ∇ 2 yy g(x, y; ζ i ) ∇ y f (x, y; ξ), where ϑ > 0 and Q ≥ 1. Here S = ξ, ζ, ζ 1 , • • • ζ Q , where ξ is drawn from distribution D, and {ζ, ζ 1 , • • • ζ Q } are drawn from distribution M.

3. ADAPTIVE BILEVEL OPTIMIZATION METHODS

In this section, we propose a class of fast single-loop adaptive bilevel optimization methods to solve the problem equation 1. Specifically, our methods adopt the universal adaptive learning rates as in Huang et al. (2021) . Moreover, our methods can be flexibly incorporate the momentum and variance reduced techniques.

3.1. BIADAM ALGORITHM

In this subsection, we propose a fast single-loop adaptive bilevel optimization method (BiAdam) based on the basic momentum technique. Algorithm 1 shows the algorithmic framework of our BiAdam algorithm. At the line 4 of Algorithm 1, we generate the adaptive matrices A t and B t for updating variables x and y, respectively. Specifically, we use the general adaptive matrix A t ⪰ ρI d (ρ > 0) for variable x, and the global adaptive matrix B t = b t I p (b t > 0). For example, we can generate the matrix A t as the Adam Kingma & Ba (2014), and generate the matrix B t as a novel version of AdaGrad-Norm Ward et al. (2019) , defined as wt = α wt-1 + (1 -α)∇ x f (x t , y t ; ξ t ) 2 , w0 = 0, A t = diag wt + ρ , t ≥ 1 (4) b t = βb t-1 + (1 -β)∥∇ y g(x t , y t ; ζ t )∥, b 0 > 0, B t = (b t + ε)I p , t ≥ 1, where α, β ∈ (0, 1) and ρ > 0, ε > 0. At the lines 5-6 of Algorithm 1, we use the generalized projection gradient iteration with Bregman distance Censor & Zenios (1992) ; Huang et al. (2021) to update the variables x and y, respectively. Algorithm 1 BiAdam Algorithm 1: Input: T, K ∈ N, parameters {γ, λ, η t , α t , β t } and initial input x 1 ∈ X and y 1 ∈ Y; 2: initialize: Draw K + 2 independent samples ξ1 = {ξ 1 , ζ 0 1 , ζ 1 1 , • • • , ζ K-1 1 } and ζ 1 , and then compute v 1 = ∇ y g(x 1 , y 1 ; ζ 1 ), and w 1 = ∇f (x 1 , y 1 ; ξ1 ) generated from (6); 3: for t = 1, 2, . . . , T do 4: Generate adaptive matrices A t ∈ R d×d , B t ∈ R p×p ; 5: xt+1 = arg min x∈X ⟨w t , x⟩ + 1 2γ (x -x t ) T A t (x -x t ) , and x t+1 = x t + η t (x t+1 -x t ); 6: ỹt+1 = arg min y∈Y ⟨v t , y⟩ + 1 2λ (y -y t ) T B t (y -y t ) , and y t+1 = y t + η t (ỹ t+1 -y t );  and ζ t+1 : 8: 7: Draw K + 2 independent samples ξt+1 = {ξ t+1 , ζ 0 t+1 , • • • , ζ K-1 t+1 } v t+1 = α t+1 ∇ y g(x t+1 , y t+1 ; ζ t+1 ) + (1 -α t+1 )v t ; 9: w t+1 = β t+1 ∇f (x t+1 , y t+1 ; ξt+1 ) + (1 -β t+1 )w t ; 10: end for 11: Output: Chosen uniformly random from {x t , y t } T t=1 . When X = R d and Y = R p , i.e., unconstrained optimization problem (1), we have x t+1 = x t - γη t A -1 t w t and y t+1 = y t -λη t B -1 t v t . At the line 7 of Algorithm 1, we draw K + 1 independent samples ξ = {ξ, ζ 0 , ζ 1 , • • • , ζ K-1 } from distributions D and M, then we define a stochastic gradient estimator as in Khanduri et al. (2021) : ∇f (x, y, ξ) = ∇ x f (x, y; ξ) -∇ 2 xy g(x, y; ζ 0 ) K L g k i=1 I p - 1 L g ∇ 2 yy g(x, y; ζ i ) ∇ y f (x, y; ξ), where K ≥ 1 and k ∼ U{0, 1, • • • , K -1} is a uniform random variable independent on ξ. In fact, the estimator ( 6) is a specific case of the above estimator (3). In practice, thus, we can use a tuning parameter ϑ ∈ (0, 1 Lg ] instead of 1 Lg in the estimator (6) as in Yang et al. (2021) . Here we use the term K Lg k i=1 I p -1 Lg ∇ 2 yy g(x, y; ζ i ) to approximate the Hessian inverse, i.e., ∇ 2 yy g(x, y; ζ) -1 . Clearly, the above ∇f (x, y, ξ) is a biased estimator in estimating ∇f (x, y), i.e. Eξ ∇f (x, y; ξ) ̸ = ∇f (x, y). In the following, we give Lemma 3, which shows that the bias R(x, y) = ∇f (x, y) -Eξ ∇f (x, y; ξ) in the gradient estimator (6) decays exponentially fast with number K. Lemma 3. (Lemma 2.1 in Khanduri et al. (2021) and Lemma 11 in Hong et al. (2020) ) Under the about Assumptions (1, 4), for any K ≥ 1, the gradient estimator in equation 6 satisfies ∥R(x, y)∥ ≤ C gxy C f y µ 1 - µ L g K , where R(x, y) = ∇f (x, y) -Eξ ∇f (x, y; ξ) . From Lemma 3, choose K = Lg µ log(C gxy C f y T /µ) in Algorithm 1, we have ∥R(x, y)∥ ≤ 1 T for all t ≥ 1. Thus, this result guarantees convergence of our algorithms only requiring a small mini-batch samples. For notational simplicity, let R t = R(x t , y t ) for all t ≥ 1. Lemma 4. (Lemma 3.1 in Khanduri et al. (2021) ) Under the above Assumptions (1, 3, 4), stochastic gradient estimate ∇f (x, y; ξ) is Lipschitz continuous, such that for x, x 1 , x 2 ∈ X and y, y 1 , y 2 ∈ Y, Eξ∥ ∇f (x 1 , y; ξ) -∇f (x 2 , y; ξ)∥ 2 ≤ L 2 K ∥x 1 -x 2 ∥ 2 , Eξ∥ ∇f (x, y 1 ; ξ) -∇f (x, y 2 ; ξ)∥ 2 ≤ L 2 K ∥y 1 -y 2 ∥ 2 , where L 2 K = 2L 2 f + 6C 2 gxy L 2 f K 2µLg-µ 2 + 6C 2 f y L 2 gxy K 2µLg-µ 2 + 6C 2 gxy L 2 f K 3 L 2 gyy (Lg-µ) 2 (2µLg-µ 2 ) .

3.2. VR-BIADAM ALGORITHM

In this subsection, we propose an accelerated version of BiAdam method (VR-BiAdam) by using the momentum-based variance reduced technique. Algorithm 2 demonstrates the algorithmic framework of our VR-BiAdam algorithm. Algorithm 2 VR-BiAdam Algorithm 1: Input: T, K ∈ N, parameters {γ, λ, η t , α t , β t } and initial input x 1 ∈ X and y 1 ∈ Y; 2: initialize: Draw K + 2 independent samples ξ1 = {ξ 1 , ζ 0 1 , ζ 1 1 , • • • , ζ K-1 1 } and ζ 1 , and then compute v 1 = ∇ y g(x 1 , y 1 ; ζ 1 ), and w 1 = ∇f (x 1 , y 1 ; ξ1 ) generated from ( 6 

4. THEORETICAL ANALYSIS

In this section, we study the convergence properties of our algorithms (BiAdam and VR-BiAdam) under some mild conditions. All proofs are provided in the Appendix A.

4.1. ADDITIONAL MILD ASSUMPTIONS

Assumption 6. The estimated stochastic partial derivative ∇f (x, y; ξ) satisfies Eξ ∇f (x, y; ξ) = ∇f (x, y) + R(x, y), Eξ∥ ∇f (x, y; ξ) -∇f (x, y) -R(x, y)∥ 2 ≤ σ 2 . The stochastic partial derivative ∇ y g(x, y; ζ) satisfies E[∇ y g(x, y; ζ)] = ∇ y g(x, y), E∥∇ y g(x, y; ζ) -∇ y g(x, y)∥ 2 ≤ σ 2 . Assumption 7. In our algorithms, the adaptive matrices A t and B t for all t ≥ 1 satisfy A t ⪰ ρI d (ρ > 0) and B t = bI p (b u ≥ b ≥ b l > 0), respectively, where ρ, b u and b l are appropriate positive numbers. Assumption 6 is commonly used in the stochastic bilevel optimization methods Ji et al. (2021) ; Yang et al. (2021) ; Khanduri et al. (2021) . In the paper, we consider the general adaptive learning rates (including the coordinate-wise and global learning rates) for variable x and the global learning rate for variable y. Assumption 7 ensures that the adaptive matrices A t for all t ≥ 1 are positive definite as in Huang et al. (2021) . Assumption 7 also guarantees the global adaptive matrices B t for all t ≥ 1 are positive definite and bounded. In fact, Assumption 7 is mild. For example, in the problem Ward et al. (2019) apply a global adaptive learning rate to the update form Cutkosky & Orabona (2019) use a global adaptive learning rate to the update form x t+1 = x t -ηg t /b t , where g t is stochastic gradient and b t = ω + t i=1 ∥∇f (x i ; ξ i )∥ 2 α /k, k > 0, ω > 0 and α ∈ (0, 1), which is equivalent to min x∈R p E[f (x; ξ)], x t = x t-1 -η ∇f (xt-1;ξt-1) bt , b 2 t = b 2 t-1 + ∥∇f (x t-1 ; ξ t-1 )∥ 2 , b 0 > 0, η > 0 for all t ≥ 1, which is equivalent to the form x t = x t-1 -ηB -1 t ∇f (x t-1 ; ξ t-1 ) with B t = b t I p and b t ≥ • • • ≥ b 0 > 0. Li & Orabona (2019); x t+1 = x t -ηB -1 t g t with B t = b t I p and b t ≥ • • • ≥ b 0 = ω α k > 0. At the same time, the problem min x∈R p f (x) = E[f (x; ξ)] approaches the stationary points, i.e., ∇f (x) = 0 or even ∇f (x; ξ) = 0 for all ξ. Thus, these global adaptive learning rates are generally bounded, i.e., b u ≥ b t ≥ b l > 0 for all t ≥ 1.

4.2. USEFUL CONVERGENCE METRIC

In the subsection, we define a useful convergence metric for our algorithms and some useful lemmas. Lemma 5. Given gradient estimator w t is generated from Algorithms 1 or 2, for all t ≥ 1, we have ∥w t -∇F (x t )∥ 2 ≤ L 2 0 ∥y * (x t ) -y t ∥ 2 + 2∥w t -∇f (x t , y t )∥ 2 , where L 2 0 = 8 L 2 f + L 2 gxy C 2 f y µ 2 + L 2 gyy C 2 gxy C 2 f y µ 4 + L 2 f C 2 gxy µ 2 . For our Algorithms 1 and 2, based on Lemma 5, we provide a convergence metric E[G t ], defined as G t = 1 γ ∥x t+1 -x t ∥ + 1 ρ √ 2∥w t -∇f (x t , y t )∥ + L 0 ∥y * (x t ) -y t ∥ , where the first two terms of G t measure the convergence of the iteration solutions {x t } T t=1 , and the last term measures the convergence of the iteration solutions 2016) associated with ϕ t (x) as follows: {y t } T t=1 . Let ϕ t (x) = 1 2 x T A t x, V t (x, x t ) = ϕ t (x) -ϕ t (x t ) + ⟨∇ϕ t (x t ), x -x t ⟩ = 1 2 (x -x t ) T A t (x -x t ). The line 5 of Algorithm 1 or 2 is equivalent to the following generalized projection problem: xt+1 = min x∈X ⟨w t , x⟩ + 1 γ V t (x, x t ) . As in Ghadimi et al. (2016) , we define a generalized projected gradient G X (x t , w t , γ) = 1 γ (x t - xt+1 ). At the same time, we define a gradient mapping G X (x t , ∇F (x t ), γ) = 1 γ (x t -x + t+1 ) with x + t+1 = arg min x∈X ⟨∇F (x t ), x⟩ + 1 γ V t (x, x t ) . According to the Proposition 1 of Ghadimi et al. (2016) , we have ||G X (x t , w t , γ) - G X (x t , ∇F (x t ), γ)|| ≤ 1 ρ ||∇F (x t ) -w t ||. Since ||G X (x t , ∇F (x t ), γ)|| ≤ ||G X (x t , w t , γ)|| + ||G X (x t , w t , γ) -G X (x t , ∇F (x t ), γ)||, we have ||G X (x t , ∇F (x t ), γ)|| ≤ ||G X (x t , w t , γ)|| + 1 ρ ||∇F (x t ) -w t || (11) ≤ 1 γ ∥x t+1 -x t ∥ + 1 ρ √ 2∥w t -∇f (x t , y t )∥ + L 0 ∥y * (x t ) -y t ∥ = G t , where the last inequality holds by the above Lemma 5. Thus, our new convergence measure E[G t ] is tighter than the standard gradient mapping E||G X (x t , ∇F (x t ), γ)|| used in Hong et al. (2020) . When G t → 0, we have ∥G X (x t , ∇F (x t ), γ)∥ → 0, where x t is a stationary point or local minimum of the bilevel problem equation 1 Ghadimi et al. (2016) ; Hong et al. (2020) .

4.3. CONVERGENCE ANALYSIS OF BIADAM ALGORITHM

In this subsection, we study the convergence properties of our BiAdam algorithm. The detailed proofs are provided in the Appendix A.5. Theorem 1. Under the above Assumptions (1, 2, 4, 6, 7), in the Algorithm 1, given X ⊂ R d , η t = k (m+t) 1/2 for all t ≥ 0, α t+1 = c 1 η t , β t+1 = c 2 η t , m ≥ max k 2 , (c 1 k) 2 , (c 2 k) 2 , k > 0, 125L 2 0 6µ 2 ≤ c 1 ≤ m 1/2 k , 9 2 ≤ c 2 ≤ m 1/2 k , 0 < λ ≤ min 15b l L 2 0 4L 2 1 µ , b l 6Lg , 0 < γ ≤ min √ 6λµρ √ 6L 2 1 λ 2 µ 2 +125b 2 u L 2 0 κ 2 , m 1/2 ρ 4Lk and K = Lg µ log(C gxy C f y T /µ), we have 1 T T t=1 E||G X (x t , ∇F (x t ), γ)|| ≤ 1 T T t=1 E[G t ] ≤ 2 √ 3Gm 1/4 T 1/2 + 2 √ 3G T 1/4 + √ 2 T = Õ( 1 T 1/4 ), where G = F (x1)-F * ρkγ + 5b1L 2 0 ∆0 ρ 2 kλµ + 2σ 2 ρ 2 k + 2mσ 2 ρ 2 k ln(m+T )+ 4(m+T ) 9ρ 2 kT 2 + 8k ρ 2 T and ∆ 0 = ∥y 1 -y * (x 1 )∥ 2 , L 2 1 = 12L 2 g µ 2 125L 2 0 + 2L 2 0 3 . Remark 1. Without loss of generality, let k = O(1) and m = O(1), we have G = O(ln(m + T )) = Õ(1). Thus our BiAdam algorithm has a convergence rate of Õ( 1 T 1/4 ). Let E[G ζ ] = 1 T T t=1 E[G t ] = Õ( 1 T 1/4 ) ≤ ϵ, we have T = Õ(ϵ -4 ). Since our BiAdam algorithm only requires K + 2 = Lg µ log(C gxy C f y T /µ) + 2 = Õ(1) samples to estimate stochastic partial derivatives in each iteration, and needs T iterations. Thus our BiAdam algorithm requires sample complexity of (K + 2)T = Õ(ϵ -4 ) for finding an ϵ-stationary point of the problem (1). Note that the convergence properties of our BiAdam algorithm for unconstrained bilevel optimization are provided in the Appendix A.3.

4.4. CONVERGENCE ANALYSIS OF VR-BIADAM ALGORITHM

In this subsection, we study convergence properties of our VR-BiAdam algorithm. The detailed proofs are provided in the Appendix A.6. Theorem 2. Under the above Assumptions (1, 3, 4, 6, 7) , in the Algorithm 2, given X ⊂ R d , η t = k (m+t) 1/3 for all t ≥ 0, α t+1 = c 1 η 2 t , β t+1 = c 2 η 2 t , m ≥ max 2, k 3 , (c 1 k) 3 , (c 2 k) 3 , k > 0, c 1 ≥ 2 3k 3 + 125L 2 0 6µ 2 , c 2 ≥ 2 3k 3 + 9 2 , 0 < λ ≤ min 15b l L 2 0 16L 2 2 µ , b l 6Lg , 0 < γ ≤ min √ 6λµρ 2 √ 24L 2 2 λ 2 µ 2 +125b 2 u L 2 0 κ 2 , m 1/3 ρ 4Lk and K = Lg µ log(C gxy C f y T /µ), we have 1 T T t=1 E||G X (x t , ∇F (x t ), γ)|| ≤ 1 T T t=1 E[G t ] ≤ 2 √ 3M m 1/6 T 1/2 + 2 √ 3M T 1/3 + √ 2 T = Õ( 1 T 1/3 ), where  M = F (x1)-F * ρkγ + 5b1L 2 0 ∆0 ρ 2 kλµ + 2m 1/3 σ 2 ρ 2 k 2 + 2k 2 (c 2 1 +c 2 2 )σ 2 ln(m+T ) ρ 2 + 6k(m+T ) 1/3 ρ 2 T , ∆ 0 = ∥y 1 - y * (x 1 )∥ 2 and L 2 2 = L 2 g + L 2 K . Remark 2. T 1/3 ). Let E[G ζ ] = 1 T T t=1 E[G t ] = Õ( 1 T 1/3 ) ≤ ϵ, we have T = Õ(ϵ -3 ). Since our VR-BiAdam algorithm requires K + 2 = Lg µ log(C gxy C f y T /µ) + 2 = Õ(1) samples to estimate stochastic partial derivatives in each iteration, and needs T iterations. Thus our VR-BiAdam algorithm requires sample complexity of (K + 2)T = Õ(ϵ -3 ) for finding an ϵ-stationary point of the problem (1). Note that the convergence properties of our VR-BiAdam algorithm for unconstrained bilevel optimization are provided in the Appendix A.3. 

5. NUMERICAL EXPERIMENTS

In this section, we perform two hyper-parameter optimization tasks to demonstrate the efficiency of our algorithms: 1) data hyper-cleaning task over the MNIST dataset; 2) hyper-representation learning task over the Omniglot dataset. In all experiments, we use a server with AMD EPYC 7763 64-Core CPU and 1 NVIDIA RTX A5000.

5.1. DATA HYPER-CLEANING

In the hyper-cleaning task, we clean a noisy dataset through a bilevel formulation. The precise formulation of the problem is included in the Appendix A.1.1. In particular, we use a training set and a validation set, where each contains 5000 images in our experiments. A portion of the training data are corrupted by randomly changing their labels, and we denote the portion of corrupted images as ρ. We compare our algorithms (i.e., BiAdam and VR-BiAdam) with various baselines. See the Appendix A.1 for a brief introduction of the baselines. For all methods, we perform grid search over hyper-parameters and choose the best setting. The detailed experimental setup is described in the Appendix A.1.1. The experimental results are summarized in Figure 1 . As shown by the figure, our BiAdam algorithm outperforms its non-adaptive counterparts such as stocBiO, MRBO and SUSTAIN, furthermore, our VR-BiAdam gets the best performance, where it outperforms VRBO, which requires using large batch-sizes every a few iterations.

5.2. HYPER-REPRESENTATION LEARNING

In the hyper-representation learning task, we learn a hyper-representation of the data such that a linear classifier can be learned quickly with a small number of data samples. The precise formulation of the problem is included in Appendix A.1.2. We compare our algorithms (i.e., BiAdam and VR-BiAdam) with various baselines. See Appendix A.1 for a brief introduction of the baselines. For all methods, we perform grid search over hyper-parameters and choose the best setting. The detailed experimental setup is described in the Appendix A.1.2. The experimental results are summarized in 

6. CONCLUSIONS

In this paper, we proposed a class of novel adaptive bilevel optimization methods for nonconvexstrongly-convex bilevel optimization problems. Our methods use unified adaptive matrices including many types of adaptive learning rates, and can flexibly use the momentum and variance reduced techniques. Moreover, we provided a useful convergence analysis framework for both the constrained and unconstrained bilevel optimization. Our VR-BiAdam algorithm reaches the best known sample complexity of Õ(ϵ -3 ) for finding an ϵ-stationary point.

A APPENDIX

In this section, we provide the additional experiment results, related works and additional theoretical results. We also provide the detailed convergence analysis.

A.1 ADDITIONAL EXPERIMENTAL DETAILS AND RESULTS

In this subsection, we introduce more details of our experiments. We compare our BiAdam and VR-BiAdam algorithms with the following bilevel optimization algorithms: reverse Franceschi et al. (2018) , AID-CG Grazzi et al. (2020) , AID-FP Grazzi et al. (2020) , stocBio Ji et al. ( 2021)), MRBO Ji & Liang (2021) , VRBO Ji & Liang (2021) , FSLA Li et al. (2021) , MSTSA/SUSTAIN Khanduri et al. (2021) , SMB Guo et al. (2021b) , SVRB Guo & Yang (2021) .

A.1.1 DATA HYPER-CLEANING

In this task, we perform data hyper-cleaning over the MNIST dataset LeCun et al. (1998) . The formulation of this problem is as follows: min τ f val τ, w * (τ ) := 1 |D V | (xi,yi)∈D V f x T i w * (τ ), y i s.t. w * (τ ) = arg min w f tr (τ, w) := 1 |D T | (xi,yi)∈D T σ(τ i )f (x T i w, y i ) + C∥w∥ 2 , where f (•) denotes the cross entropy loss, D T and D V are training and validation dataset, respectively. Here τ = {τ i } i∈D T are hyper-parameters and C ≥ 0 is a tuning parameter, σ(•) denotes the sigmoid function. In the experiment, we set C = 0.001. For training/validation batch-size, we use batch-size of 32, while for VRBO, we choose larger batchsize 5000 and sampling interval is set as 3. For AID-FP, AID-CG and reverse, we use the warm-start trick as our algorithms, i.e. the inner variable starts from the state of last iteration. We fine tune the number of inner-loop iterations and set it to be 50 for these algorithms. For MRBO, VRBO, SUSTAIN and our BiAdam/VR-BiAdam, we set K = 3 to evaluate the hyper-gradient. For FSLA, K = 1 as the hyper-gradient is evaluated recursively. As for learning rates, we set 1000 as the outer learning rate for all algorithms except our algorithms which use 0.5 as we change the learning rate adaptively. As for the inner learning rates, we set the stepsize as 0.05 for reverse, AID-CG, stocBiO/AID-FP, MRBO/SUSTAIN, FSLA; we set the stepsize as 0.2 for VRBO; we set the stepsize as 1 for SUSTAIN; we set the stepsize as 0.00025 for BiAdam and 0.0005 for VR-BiAdam.

A.1.2 HYPER-REPRESENTATION LEARNING

In this task, we perform the hyper-representation learning task over the Omniglot dataset Lake et al. (2015) . The formulation of this problem is as follows: min τ f val τ, w * (τ ) := E f val τ, w * (τ ); ξ := 1 |D V,ξ | (xi,yi)∈D V,ξ f (ω * τ ) T ϕ(x i ; τ ), y i s.t. w * (τ ) = arg min w f tr (τ, w) := 1 |D T ,ξ | (xi,yi)∈D T ,ξ f ω T ϕ(x i ; τ ), y i + C∥w∥ 2 , where f (•) denotes the cross entropy loss, D T ,ξ and D V,ξ are training and validation dataset for a randomly sampled meta task. Here τ = {τ i } i∈D T are hyper-representations and C ≥ 0 is a tuning parameter to gaurantee the inner problem to be strongly convex. In experiment, we set C = 0.01. In every hyper-iteration, we choose 4 meta tasks, while for VRBO, we choose larger batch-size 16 and sampling interval is set as 3. For stocBiO/AID-FP, AID-CG and reverse, we use the warmstart trick as our algorithms, i.e. the inner variable starts from the state of last iteration. We fine tune the number of inner-loop iterations and set it to be 16 for these algorithms. For MRBO, VRBO, SUSTAIN and our algorithms, we set K = 5 to evaluate the hyper-gradient. For FSLA, K = 1 as the hyper-gradient is evaluated recursively. As for learning rates, we set 1000 as the outer learning rate for all algorithms except our algorithms which use 0.001 as we change the learning rate adaptively. As for the inner learning rates, we set the stepsize as 0.4 for all algorithms.

A.2 RELATED WORKS

In this subsection, we overview the existing bilevel optimization methods and adaptive methods for single-level optimization, respectively. 

A.3 ADDITIONAL THEORETICAL RESULTS

In this subsection, we further give the convergence properties of our BiAdam algorithm for unconstrained bilevel optimization. Theorem 3. Under the above Assumptions (1, 2, 4, 6, 7), in the Algorithm 1, given X = R d , η t = k (m+t) 1/2 for all t ≥ 0, α t+1 = c 1 η t , β t+1 = c 2 η t , m ≥ max k 2 , (c 1 k) 2 , (c 2 k) 2 , k > 0, 125L 2 0 6µ 2 ≤ c 1 ≤ m 1/2 k , 9 2 ≤ c 2 ≤ m 1/2 k , 0 < λ ≤ min 15b l L 2 0 4L 2 1 µ , b l 6Lg , 0 < γ ≤ Under review as a conference paper at ICLR 2023 min √ 6λµρ √ 6L 2 1 λ 2 µ 2 +125b 2 u L 2 0 κ 2 , m 1/2 ρ 4Lk and K = Lg µ log(C gxy C f y T /µ), we have 1 T T t=1 E∥∇F (x t )∥ ≤ 1 T T t=1 E∥A t ∥ 2 ρ 2 √ 6G ′ m T 1/2 + 2 √ 6G ′ T 1/4 + 2 √ 3 T = Õ( 1 T 1/4 ), (14) where G ′ = ρ(F (x1)-F * ) kγ + 5b1L 2 0 ∆0 kλµ + 2σ 2 k + 2mσ 2 k ln(m + T ) + 4(m+T ) 9kT 2 + 8k T . Remark 3. Under the same conditions in Theorem 1, based on the metric 1 T T t=1 E∥∇F (x t )∥, our BiAdam algorithm still has a gradient complexity of Õ(ϵ -4 ) without relying on the large mini-batches. Interestingly, the right hand side of the above inequality ( 14) includes a term √ 1 T T t=1 E∥At∥ 2 ρ that can be seen as an upper bound of the expected condition number of adaptive matrices {A t } T t=1 . When A t given in the above (4), we have √ 1 T T t=1 E∥At∥ 2 ρ ≤ G1+λ λ as in the existing adaptive gradient methods assuming the standard bounded stochastic gradient ∥∇f (x; ξ)∥ ≤ G 1 . Next, we further give the convergence properties of our VR-BiAdam algorithm for unconstrained bilevel optimization. Theorem 4. Under the above Assumptions (1, 3, 4, 6, 7), in the Algorithm 2, given X = R d , η t = k (m+t) 1/3 for all t ≥ 0, α t+1 = c 1 η 2 t , β t+1 = c 2 η 2 t , m ≥ max 2, k 3 , (c 1 k) 3 , (c 2 k) 3 , k > 0, c 1 ≥ 2 3k 3 + 125L 2 0 6µ 2 , c 2 ≥ 2 3k 3 + 9 2 , 0 < λ ≤ min 15b l L 2 0 16L 2 2 µ , b l 6Lg , 0 < γ ≤ min √ 6λµρ 2 √ 24L 2 2 λ 2 µ 2 +125b 2 u L 2 0 κ 2 , m 1/3 ρ 4Lk and K = Lg µ log(C gxy C f y T /µ), we have 1 T T t=1 E∥∇F (x t )∥ ≤ 1 T T t=1 E∥A t ∥ 2 ρ 2 √ 6M ′ m T 1/2 + 2 √ 6M ′ T 1/3 + 2 √ 3 T = Õ( 1 T 1/3 ), ( ) where M ′ = ρ(F (x1)-F * ) kγ + 5b1L 2 0 ∆0 kλµ + 2m 1/3 σ 2 k 2 + 2k 2 (c 2 1 + c 2 2 )σ 2 ln(m + T ) + 6k(m+T ) 1/3 T . Remark 4. Clearly, the adaptive matrix A t generated from the above (4), we have √ 1 T T t=1 E∥At∥ 2 ρ ≤ G1+λ λ as in the existing adaptive gradient methods assuming the standard bounded stochastic gradient ∥∇f (x; ξ)∥ ≤ G 1 . Under the same conditions in Theorem 2, based on the metric 1 T T t=1 E∥∇F (x t )∥, our VR-BiAdam algorithm still has a convergence rate of Õ( 1T 1/3 ) and a sample complexity of Õ(ϵ -3 ) without relying on the large mini-batches.

A.4 DETAILED CONVERGENCE ANALYSIS

In this subsection, we provide the detailed convergence analysis of our algorithms. We first review and provide some useful lemmas. Given a ρ-strongly convex function ϕ(x), we define a prox-function (Bregman distance) Censor & Lent (1981) ; Censor & Zenios (1992) associated with ϕ(x) as follows: V (z, x) = ϕ(z) -ϕ(x) + ⟨∇ϕ(x), z -x⟩ . ( ) Then we define a generalized projection problem as in Ghadimi et al. (2016) : x * = arg min z∈X ⟨z, w⟩ + 1 γ V (z, x) + h(z) , where X ⊆ R d , w ∈ R d and γ > 0. Here h(x) is convex and possibly nonsmooth function. At the same time, we define a generalized gradient as follows: G X (x, w, γ) = 1 γ (x -x * ). ( ) Lemma 6. (Lemma 1 in Ghadimi et al. (2016) ) Let x * be given in (17). Then, for any x ∈ X , w ∈ R d and γ > 0, we have ⟨w, G X (x, w, γ)⟩ ≥ ρ∥G X (x, w, γ)∥ 2 + 1 γ h(x * ) -h(x) , where ρ > 0 depends on ρ-strongly convex function ϕ(x). When h(x) = 0, in the above lemma 6, we have ⟨w, G X (x, w, γ)⟩ ≥ ρ∥G X (x, w, γ)∥ 2 . ( ) Lemma 7. (Restatement of Lemma 5) When the gradient estimator w t generated from Algorithm 1 or 2, for all t ≥ 1, we have ∥w t -∇F (x t )∥ 2 ≤ L 2 0 ∥y * (x t ) -y t ∥ 2 + 2∥w t -∇f (x t , y t )∥ 2 , ( ) where L 2 0 = 8 L 2 f + L 2 gxy C 2 f y µ 2 + L 2 gyy C 2 gxy C 2 f y µ 4 + L 2 f C 2 gxy µ 2 . Proof. We first consider the term ∥∇F (x t ) -∇f (x t , y t )∥ 2 . Since ∇f (x t , y * (x t )) = ∇F (x t ), we have ∥∇f (x t , y * (x t )) -∇f (x t , y t )∥ 2 = ∥∇ x f (x t , y * (x t )) -∇ 2 xy g(x t , y * (x t )) ∇ 2 yy g(x t , y * (x t )) -1 ∇ y f (x t , y * (x)) -∇ x f (x t , y t ) + ∇ 2 xy g(x t , y t ) ∇ 2 yy g(x t , y t ) ∇ y f (x t , y t )∥ 2 = ∥∇ x f (x t , y * (x t )) -∇ x f (x t , y t ) -∇ 2 xy g(x t , y * (x t )) ∇ 2 yy g(x t , y * (x t )) -1 ∇ y f (x t , y * (x t )) + ∇ 2 xy g(x t , y t ) ∇ 2 yy g(x t , y * (x t )) -1 ∇ y f (x t , y * (x t )) -∇ 2 xy g(x t , y t ) ∇ 2 yy g(x t , y * (x t )) -1 ∇ y f (x t , y * (x t )) + ∇ 2 xy g(x t , y t ) ∇ 2 yy g(x t , y t ) -1 ∇ y f (x t , y * (x t )) -∇ 2 xy g(x t , y t ) ∇ 2 yy g(x t , y t ) -1 ∇ y f (x t , y * (x t )) + ∇ 2 xy g(x t , y t ) ∇ 2 yy g(x t , y t ) -1 ∇ y f (x t , y t )∥ 2 ≤ 4∥∇ x f (x t , y * (x t )) -∇ x f (x t , y t )∥ 2 + 4C 2 f y µ 2 ∥∇ 2 xy g(x t , y * (x t )) -∇ 2 xy g(x t , y t )∥ 2 + 4C 2 gxy C 2 f y µ 4 ∥∇ 2 yy g(x t , y * (x t )) -∇ 2 yy g(x t , y t )∥ 2 + 4C 2 gxy µ 2 ∥∇ y f (x t , y * (x t )) -∇ y f (x t , y t )∥ 2 ≤ 4 L 2 f + L 2 gxy C 2 f y µ 2 + L 2 gyy C 2 gxy C 2 f y µ 4 + L 2 f C 2 gxy µ 2 ∥y * (x t ) -y t ∥ 2 = 4 L2 ∥y * (x t ) -y t ∥ 2 , ( ) where the second last inequality is due to Assumptions 1, 2 and 4; the last equality holds by L2 = L 2 f + L 2 gxy C 2 f y µ 2 + L 2 gyy C 2 gxy C 2 f y µ 4 + L 2 f C 2 gxy µ 2 . Then we have ∥w t -∇F (x t )∥ 2 = ∥w t -∇f (x t , y t ) + ∇f (x t , y t ) -∇F (x t )∥ 2 ≤ 2∥w t -∇f (x t , y t )∥ 2 + 2∥ ∇f (x t , y t ) -∇F (x t )∥ 2 ≤ 2∥w t -∇f (x t , y t )∥ 2 + 8 L2 ∥y * (x t ) -y t ∥ 2 . ( ) Lemma 8. Under the Assumptions 1, 2, 4, we have ∥ ∇f (x t+1 , y t+1 ) -∇f (x t , y t )∥ 2 ≤ L 2 0 ∥x t+1 -x t ∥ 2 + ∥y t+1 -y t ∥ 2 , ( ) where L 2 0 = 8 L 2 f + L 2 gxy C 2 f y µ 2 + L 2 gyy C 2 gxy C 2 f y µ 4 + L 2 f C 2 gxy µ 2 . Proof. ∥ ∇f (x t+1 , y t+1 ) -∇f (x t , y t )∥ 2 = ∥∇ x f (x t+1 , y t+1 ) -∇ 2 xy g(x t+1 , y t+1 ) ∇ 2 yy g(x t+1 , y t+1 ) -1 ∇ y f (x t+1 , y t+1 ) -∇ x f (x t , y t ) + ∇ 2 xy g(x t , y t ) ∇ 2 yy g(x t , y t ) -1 ∇ y f (x t , y t )∥ 2 = ∥∇ x f (x t+1 , y t+1 ) -∇ x f (x t , y t ) -∇ 2 xy g(x t+1 , y t+1 ) ∇ 2 yy g(x t+1 , y t+1 ) -1 ∇ y f (x t+1 , y t+1 ) + ∇ 2 xy g(x t , y t ) ∇ 2 yy g(x t+1 , y t+1 ) -1 ∇ y f (x t+1 , y t+1 ) -∇ 2 xy g(x t , y t ) ∇ 2 yy g(x t+1 , y t+1 ) -1 ∇ y f (x t+1 , y t+1 ) + ∇ 2 xy g(x t , y t ) ∇ 2 yy g(x t , y t ) -1 ∇ y f (x t+1 , y t+1 ) -∇ 2 xy g(x t , y t ) ∇ 2 yy g(x t , y t ) -1 ∇ y f (x t+1 , y t+1 ) + ∇ 2 xy g(x t , y t ) ∇ 2 yy g(x t , y t ) -1 ∇ y f (x t , y t )∥ 2 ≤ 4∥∇ x f (x t+1 , y t+1 ) -∇ x f (x t , y t )∥ 2 + 4C 2 f y µ 2 ∥∇ 2 xy g(x t+1 , y t+1 ) -∇ 2 xy g(x t , y t )∥ 2 + 4C 2 gxy C 2 f y µ 4 ∥∇ 2 yy g(x t+1 , y t+1 ) -∇ 2 yy g(x t , y t )∥ 2 + 4C 2 gxy µ 2 ∥∇ y f (x t+1 , y t+1 ) -∇ y f (x t , y t )∥ 2 ≤ 8L 2 f ∥x t+1 -x t ∥ 2 + ∥y t+1 -y t ∥ 2 + 8L 2 gxy C 2 f y µ 2 ∥x t+1 -x t ∥ 2 + ∥y t+1 -y t ∥ 2 + 8L 2 gyy C 2 gxy C 2 f y µ 4 ∥x t+1 -x t ∥ 2 + ∥y t+1 -y t ∥ 2 + 8L 2 f C 2 gxy µ 2 ∥x t+1 -x t ∥ 2 + ∥y t+1 -y t ∥ 2 = L 2 0 ∥x t+1 -x t ∥ 2 + ∥y t+1 -y t ∥ 2 , where the first inequality holds by the Assumptions 1 and 4, and the second inequality holds by the Assumption 2. Lemma 9. Suppose that the sequence {x t , y t } T t=1 be generated from Algorithm 1 or 2. Let 0 < η t ≤ 1 and 0 < γ ≤ ρ 2Lηt , then we have F (x t+1 ) ≤ F (x t ) + η t γ ρ ∥∇F (x t ) -w t ∥ 2 - ρη t 2γ ∥x t+1 -x t ∥ 2 . ( ) Proof. According to Lemma 2, the function F (x) is L-smooth. Thus we have F (x t+1 ) ≤ F (x t ) + ⟨∇F (x t ), x t+1 -x t ⟩ + L 2 ∥x t+1 -x t ∥ 2 (27) = F (x t ) + ⟨∇F (x t ), η t (x t+1 -x t )⟩ + L 2 ∥η t (x t+1 -x t )∥ 2 = F (x t ) + η t ⟨w t , xt+1 -x t ⟩ =T1 +η t ⟨∇F (x t ) -w t , xt+1 -x t ⟩ =T2 + Lη 2 t 2 ∥x t+1 -x t ∥ 2 , where the second equality is due to x t+1 = x t + η t (x t+1 -x t ). According to Assumption 7, i.e., A t ≻ ρI d for any t ≥ 1, the function ϕ t (x) = 1 2 x T A t x is ρstrongly convex, then we define a prox-function (a.k.a. Bregman distance) associated with ϕ t (x) as in Censor & Zenios (1992) ; Ghadimi et al. (2016) , V t (x, x t ) = ϕ t (x) -ϕ t (x t ) + ⟨∇ϕ t (x t ), x -x t ⟩ = 1 2 (x -x t ) T A t (x -x t ). By using the above Lemma 6 to the problem xt+1 = arg min x∈X ⟨w t , x⟩+ 1 2γ (x-x t ) T A t (x-x t ) at the line 5 of Algorithm 1 or 2, we have ⟨w t , 1 γ (x t -xt+1 )⟩ ≥ ρ∥ 1 γ (x t -xt+1 )∥ 2 . ( ) T 1 = ⟨w t , xt+1 -x t ⟩ ≤ - ρ γ ∥x t+1 -x t ∥ 2 . ( ) Next, consider the bound of the term T 2 , we have T 2 = ⟨∇F (x t ) -w t , xt+1 -x t ⟩ ≤ ∥∇F (x t ) -w t ∥ • ∥x t+1 -x t ∥ ≤ γ ρ ∥∇F (x t ) -w t ∥ 2 + ρ 4γ ∥x t+1 -x t ∥ 2 , ( ) where the first inequality is due to the Cauchy-Schwarz inequality and the last is due to Young's inequality. By combining the above inequalities ( 27), ( 30) with (31), we obtain F (x t+1 ) ≤ F (x t ) + η t ⟨∇F (x t ) -w t , xt+1 -x t ⟩ + η t ⟨w t , xt+1 -x t ⟩ + Lη 2 t 2 ∥x t+1 -x t ∥ 2 ≤ F (x t ) + η t γ ρ ∥∇F (x t ) -w t ∥ 2 + ρη t 4γ ∥x t+1 -x t ∥ 2 - ρη t γ ∥x t+1 -x t ∥ 2 + Lη 2 t 2 ∥x t+1 -x t ∥ 2 = F (x t ) + η t γ ρ ∥∇F (x t ) -w t ∥ 2 - ρη t 2γ ∥x t+1 -x t ∥ 2 - ρη t 4γ - Lη 2 t 2 ∥x t+1 -x t ∥ 2 ≤ F (x t ) + η t γ ρ ∥∇F (x t ) -w t ∥ 2 - ρη t 2γ ∥x t+1 -x t ∥ 2 , ( ) where the last inequality is due to 0 < γ ≤ ρ 2Lηt . Lemma 10. Suppose the sequence {x t , y t } T t=1 be generated from Algorithm 1 or 2. Under the above assumptions, given 0 < η t ≤ 1, B t = b t I p (b u ≥ b t ≥ b l > 0) for all t ≥ 1, and 0 < λ ≤ b l 6Lg , we have ∥y t+1 -y * (x t+1 )∥ 2 ≤ (1 - η t µλ 4b t )∥y t -y * (x t )∥ 2 - 3η t 4 ∥ỹ t+1 -y t ∥ 2 + 25η t λ 6µb t ∥∇ y g(x t , y t ) -v t ∥ 2 + 25κ 2 η t b t 6µλ ∥x t+1 -x t ∥ 2 , ( ) where κ = L g /µ. Proof. According to Assumption 1, i.e., the function g(x, y) is µ-strongly convex w.r.t y, we have g(x t , y) ≥ g(x t , y t ) + ⟨∇ y g(x t , y t ), y - y t ⟩ + µ 2 ∥y -y t ∥ 2 = g(x t , y t ) + ⟨v t , y -ỹt+1 ⟩ + ⟨∇ y g(x t , y t ) -v t , y -ỹt+1 ⟩ + ⟨∇ y g(x t , y t ), ỹt+1 -y t ⟩ + µ 2 ∥y -y t ∥ 2 . ( ) According to the Assumption 2, i.e., the function g(x, y) is L g -smooth, we have g(x t , ỹt+1 ) ≤ g(x t , y t ) + ⟨∇ y g(x t , y t ), ỹt+1 -y t ⟩ + L g 2 ∥ỹ t+1 -y t ∥ 2 . ( ) Combining the about inequalities (34) with ( 35), we have g(x t , y) ≥ g(x t , ỹt+1 ) + ⟨v t , y -ỹt+1 ⟩ + ⟨∇ y g(x t , y t ) -v t , y -ỹt+1 ⟩ + µ 2 ∥y -y t ∥ 2 - L g 2 ∥ỹ t+1 -y t ∥ 2 . ( ) By the optimality condition of the problem ỹt+1 = arg min y∈Y ⟨v t , y⟩ + 1 2λ (y -y t ) T B t (y -y t ) at the line 6 of Algorithm 1 or 2, given B t = b t I p , we have ⟨v t + b t λ (ỹ t+1 -y t ), y -ỹt+1 ⟩ ≥ 0, ∀y ∈ Y. ( ) ⟨v t , y -ỹt+1 ⟩ ≥ b t λ ⟨ỹ t+1 -y t , ỹt+1 -y⟩ = b t λ ∥ỹ t+1 -y t ∥ 2 + b t λ ⟨ỹ t+1 -y t , y t -y⟩. ( ) By pugging the inequalities ( 38) into (36), we have g(x t , y) ≥ g(x t , ỹt+1 ) + b t λ ⟨ỹ t+1 -y t , y t -y⟩ + b t λ ∥ỹ t+1 -y t ∥ 2 + ⟨∇ y g(x t , y t ) -v t , y -ỹt+1 ⟩ + µ 2 ∥y -y t ∥ 2 - L g 2 ∥ỹ t+1 -y t ∥ 2 . ( ) Let y = y * (x t ), then we have g(x t , y * (x t )) ≥ g(x t , ỹt+1 ) + b t λ ⟨ỹ t+1 -y t , y t -y * (x t )⟩ + ( b t λ - L g 2 )∥ỹ t+1 -y t ∥ 2 + ⟨∇ y g(x t , y t ) -v t , y * (x t ) -ỹt+1 ⟩ + µ 2 ∥y * (x t ) -y t ∥ 2 . ( ) Due to the strongly-convexity of g(•, y) and y * (x t ) = arg min y∈Y g(x t , y), we have g(x t , y * (x t )) ≤ g(x t , ỹt+1 ). Thus, we obtain 0 ≥ b t λ ⟨ỹ t+1 -y t , y t -y * (x t )⟩ + ⟨∇ y g(x t , y t ) -v t , y * (x t ) -ỹt+1 ⟩ + ( b t λ - L g 2 )∥ỹ t+1 -y t ∥ 2 + µ 2 ∥y * (x t ) -y t ∥ 2 . ( ) By y t+1 = y t + η t (ỹ t+1 -y t ), we have ∥y t+1 -y * (x t )∥ 2 = ∥y t + η t (ỹ t+1 -y t ) -y * (x t )∥ 2 = ∥y t -y * (x t )∥ 2 + 2η t ⟨ỹ t+1 -y t , y t -y * (x t )⟩ + η 2 t ∥ỹ t+1 -y t ∥ 2 . ( ) Then we obtain ⟨ỹ t+1 -y t , y t -y * (x t )⟩ = 1 2η t ∥y t+1 -y * (x t )∥ 2 - 1 2η t ∥y t -y * (x t )∥ 2 - η t 2 ∥ỹ t+1 -y t ∥ 2 . ( ) where the second inequality holds by L g ≥ µ and 0 < η t ≤ 1, and the last inequality is due to 0 < λ ≤ b l 6Lg ≤ bt 6Lg . It implies that ∥y t+1 -y * (x t )∥ 2 ≤ (1 - η t µλ 2b t )∥y t -y * (x t )∥ 2 - 3η t 4 ∥ỹ t+1 -y t ∥ 2 + 4η t λ µb t ∥∇ y g(x t , y t ) -v t ∥ 2 . ( ) Next, we decompose the term ∥y t+1 -y * (x t+1 )∥ 2 as follows: ∥y t+1 -y * (x t+1 )∥ 2 = ∥y t+1 -y * (x t ) + y * (x t ) -y * (x t+1 )∥ 2 = ∥y t+1 -y * (x t )∥ 2 + 2⟨y t+1 -y * (x t ), y * (x t ) -y * (x t+1 )⟩ + ∥y * (x t ) -y * (x t+1 )∥ 2 ≤ (1 + η t µλ 4b t )∥y t+1 -y * (x t )∥ 2 + (1 + 4b t η t µλ )∥y * (x t ) -y * (x t+1 )∥ 2 ≤ (1 + η t µλ 4b t )∥y t+1 -y * (x t )∥ 2 + (1 + 4b t η t µλ )κ 2 ∥x t -x t+1 ∥ 2 , ( ) where the first inequality holds by Cauchy-Schwarz inequality and Young's inequality, and the second inequality is due to Lemma 3, and the last equality holds by x t+1 = x t + η t (x t+1 -x t ). By combining the above inequalities ( 46) and ( 47), we have ∥y t+1 -y * (x t+1 )∥ 2 ≤ (1 + η t µλ 4b t )(1 - η t µλ 2b t )∥y t -y * (x t )∥ 2 -(1 + η t µλ 4b t ) 3η t 4 ∥ỹ t+1 -y t ∥ 2 + (1 + η t µλ 4b t ) 4η t λ µb t ∥∇ y g(x t , y t ) -v t ∥ 2 + (1 + 4b η t µλ )κ 2 ∥x t -x t+1 ∥ 2 . Since 0 < η t ≤ 1, 0 < λ ≤ b l 6Lg ≤ bt 6Lg and L g ≥ µ, we have λ ≤ bt 6Lg ≤ bt 6µ and η t ≤ 1 ≤ bt 6µλ . Then we have (1 + η t µλ 4b t )(1 - η t µλ 2b t ) = 1 - η t µλ 2b t + η t µλ 4b t - η 2 t µ 2 λ 2 8b 2 t ≤ 1 - η t µλ 4b t , -(1 + η t µλ 4b t ) 3η t 4 ≤ - 3η t 4 , (1 + η t µλ 4b t ) 4η t λ µb t ≤ (1 + 1 24 ) 4η t λ µb t = 25η t λ 6µb t , (1 + 4b t η t µλ )κ 2 ≤ b t κ 2 6η t µλ + 4b t κ 2 η t µλ = 25b t κ 2 6η t µλ , where the second last inequality is due to ηtµλ bt ≤ 1 6 and the last inequality holds by bt 6µληt ≥ 1. By using x t+1 = x t + η t (x t+1 -x t ), then we have ∥y t+1 -y * (x t+1 )∥ 2 ≤ (1 - η t µλ 4b t )∥y t -y * (x t )∥ 2 - 3η t 4 ∥ỹ t+1 -y t ∥ 2 + 25η t λ 6µb t ∥∇ y g(x t , y t ) -v t ∥ 2 + 25κ 2 η t b t 6µλ ∥x t+1 -x t ∥ 2 . ( ) A.5 CONVERGENCE ANALYSIS OF BIADAM ALGORITHM In the subsection, we provide the detail convergence analysis of BiAdam algorithm. For notational simplicity, let R t = R(x t , y t ) for all t ≥ 1. Lemma 11. Assume that the stochastic partial derivatives v t+1 , and w t+1 be generated from Algorithm 1, we have E∥w t+1 -∇f (x t+1 , y t+1 ) -R t+1 ∥ 2 ≤ (1 -β t+1 )E∥w t -∇f (x t , y t ) -R t ∥ 2 + β 2 t+1 σ 2 (49) + 3L 2 0 η 2 t β t+1 ∥x t+1 -x t ∥ 2 + ∥ỹ t+1 -y t ∥ 2 + 3 β t+1 ∥R t ∥ 2 + ∥R t+1 ∥ 2 , E∥v t+1 -∇ y g(x t+1 , y t+1 )∥ 2 ≤ (1 -α t+1 )E∥v t -∇ y g(x t , y t )∥ 2 + α 2 t+1 σ 2 (50) + 2L 2 g η 2 t /α t+1 E∥x t+1 -x t ∥ 2 + E∥ỹ t+1 -y t ∥ 2 , where L 2 0 = 8 L 2 f + L 2 gxy C 2 f y µ 2 + L 2 gyy C 2 gxy C 2 f y µ 4 + L 2 f C 2 gxy µ 2 and R t = ∇f (x t , y t ) -Eξ[ ∇f (x t , y t ; ξ)] for all t ≥ 1. Proof. Without loss of generality, we only prove the term E∥w t+1 -∇f (x t+1 , y t+1 ) -R t+1 ∥ 2 . The other term is similar for this term. Since w t+1 = β t+1 ∇f (x t+1 , y t+1 ; ξt+1 ) + (1 -β t+1 )w t , we have E∥w t+1 -∇f (x t+1 , y t+1 ) -R t+1 ∥ 2 (51) = E∥β t+1 ∇f (x t+1 , y t+1 ; ξt+1 ) + (1 -β t+1 )w t -∇f (x t+1 , y t+1 ) -R t+1 ∥ 2 = E∥(1 -β t+1 )(w t -∇f (x t , y t ) -R t ) + β t+1 ∇f (x t+1 , y t+1 ; ξt+1 ) -∇f (x t+1 , y t+1 ) -R t+1 + (1 -β t+1 ) ∇f (x t , y t ) + R t -( ∇f (x t+1 , y t+1 ) + R t+1 ) ∥ 2 = β 2 t+1 E∥ ∇f (x t+1 , y t+1 ; ξt+1 ) -∇f (x t+1 , y t+1 ) -R t+1 ∥ 2 + E∥(1 -β t+1 ) w t -∇f (x t , y t ) -R t + ∇f (x t , y t ) + R t -( ∇f (x t+1 , y t+1 ) + R t+1 ) ∥ 2 ≤ β 2 t+1 E∥ ∇f (x t+1 , y t+1 ; ξt+1 ) -∇f (x t+1 , y t+1 ) -R t+1 ∥ 2 + (1 -β t+1 ) 2 (1 + β t+1 )E∥w t -∇f (x t , y t ) -R t ∥ 2 + (1 -β t+1 ) 2 (1 + 1 β t+1 )E∥ ∇f (x t , y t ) + R t -( ∇f (x t+1 , y t+1 ) + R t+1 )∥ 2 ≤ (1 -β t+1 )E∥w t -∇f (x t , y t ) -R t ∥ 2 + β 2 t+1 σ 2 + 1 β t+1 ∥ ∇f (x t , y t ) + R t -( ∇f (x t+1 , y t+1 ) + R t+1 )∥ 2 ≤ (1 -β t+1 )E∥w t -∇f (x t , y t ) -R t ∥ 2 + β 2 t+1 σ 2 + 3 β t+1 ∥ ∇f (x t+1 , y t+1 ) -∇f (x t , y t )∥ 2 + 3 β t+1 ∥R t ∥ 2 + ∥R t+1 )∥ 2 ≤ (1 -β t+1 )E∥w t -∇f (x t , y t ) -R t ∥ 2 + β 2 t+1 σ 2 + 3 β t+1 ∥R t ∥ 2 + ∥R t+1 ∥ 2 + 3L 2 0 η 2 t β t+1 ∥x t+1 -x t ∥ 2 + ∥ỹ t+1 -y t ∥ 2 , where the third equality is due to Eξ t+1 [ ∇f (x t+1 , y t+1 ; ξt+1 )] = ∇f (x t+1 , y t+1 ) + R t+1 ; the second last inequality holds by 0 ≤ β t+1 ≤ 1 such that (1 -β t+1 ) 2 (1 + β t+1 ) = 1 -β t+1 -β 2 t+1 + β 3 t+1 ≤ 1 -β t+1 and (1 -β t+1 ) 2 (1 + 1 βt+1 ) ≤ (1 -β t+1 )(1 + 1 βt+1 ) = -β t+1 + 1 βt+1 ≤ 1 βt+1 , and the last inequality holds by the above Lemma 8 and x t+1 = x t + η t (x t+1 -x t ), y t+1 = y t + η t (ỹ t+1 -y t ). Theorem 5. (Restatement of Theorem 1) Under the above Assumptions (1, 2, 4, 6, 7), in the Algorithm 1, given X ⊂ R d , η t = k (m+t) 1/2 for all t ≥ 0, α t+1 = c 1 η t , β t+1 = c 2 η t , m ≥ max k 2 , (c 1 k) 2 , (c 2 k) 2 , k > 0, 125L 2 0 6µ 2 ≤ c 1 ≤ m 1/2 k , 9 2 ≤ c 2 ≤ m 1/2 k , 0 < λ ≤ min 15b l L 2 0 4L 2 1 µ , b l 6Lg , 0 < γ ≤ min √ 6λµρ √ 6L 2 1 λ 2 µ 2 +125b 2 u L 2 0 κ 2 , m 1/2 ρ 4Lk and K = Lg µ log(C gxy C f y T /µ), we have 1 T T t=1 E||G X (x t , ∇F (x t ), γ)|| ≤ 1 T T t=1 E[G t ] ≤ 2 √ 3Gm 1/4 T 1/2 + 2 √ 3G T 1/4 + √ 2 T , where G = F (x1)-F * ρkγ + 5b1L 2 0 ∆0 ρ 2 kλµ + 2σ 2 ρ 2 k + 2mσ 2 ρ 2 k ln(m + T ) + 4(m+T ) 9ρ 2 kT 2 + 8k ρ 2 T , ∆ 0 = ∥y 1 -y * (x 1 )∥ 2 and L 2 1 = 12L 2 g µ 2 125L 2 0 + 2L 2 0 3 . Proof. Since η t = k (m+t) 1/2 on t is decreasing and m ≥ k 2 , we have η t ≤ η 0 = k m 1/2 ≤ 1 and γ ≤ m 1/2 ρ 4Lk ≤ ρ 2Lη0 ≤ ρ 2Lηt for any t ≥ 0. Due to 0 < η t ≤ 1 and m ≥ (c 1 k) 2 , we have α t+1 = c 1 η t ≤ c1k m 1/2 ≤ 1. Similarly, due to m ≥ (c 2 k) 2 , we have β t+1 ≤ 1. At the same time, we have c 1 , c 2 ≤ m 1/2 k . According to Lemma 11, we have E∥v t+1 -∇ y g(x t+1 , y t+1 )∥ 2 -E∥v t -∇ y g(x t , y t )∥ 2 (54) ≤ -c 1 η t E∥∇ y g(x t , y t ) -v t ∥ 2 + 2L 2 g η t /c 1 ∥x t+1 -x t ∥ 2 + ∥ỹ t+1 -y t ∥ 2 + c 2 1 η 2 t σ 2 ≤ - 125L 2 0 6µ 2 E∥∇ y g(x t , y t ) -v t ∥ 2 + 12L 2 g µ 2 η t 125L 2 0 ∥x t+1 -x t ∥ 2 + ∥ỹ t+1 -y t ∥ 2 + mη 2 t σ 2 k 2 , where the above equality holds by α t+1 = c 1 η t , and the last inequality is due to 125L 2 0 6µ 2 ≤ c 1 ≤ m 1/2 k . Similarly, we have E∥w t+1 -∇f (x t+1 , y t+1 ) -R t+1 ∥ 2 -E∥w t -∇f (x t , y t ) -R t ∥ 2 (55) ≤ -β t+1 E∥w t -∇f (x t , y t ) -R t ∥ 2 + 3L 2 0 η 2 t β t+1 ∥x t+1 -x t ∥ 2 + ∥ỹ t+1 -y t ∥ 2 + 3 β t+1 ∥R t ∥ 2 + ∥R t+1 ∥ 2 + β 2 t+1 σ 2 ≤ - 9η t 2 E∥w t -∇f (x t , y t ) -R t ∥ 2 + 2L 2 0 η t 3 ∥x t+1 -x t ∥ 2 + ∥ỹ t+1 -y t ∥ 2 + 2 3η t ∥R t ∥ 2 + ∥R t+1 ∥ 2 + mη 2 t σ 2 k 2 , where the last inequality holds by β t+1 = c 2 η t and 9 2 ≤ c 2 ≤ m 1/2 k . According to Lemmas 7 and 9, we have F (x t+1 ) -F (x t ) ≤ 2η t γ ρ ∥w t -∇f (x t , y t )∥ 2 + L 2 0 η t γ ρ ∥y * (x t ) -y t ∥ 2 - ρη t 2γ ∥x t+1 -x t ∥ 2 ≤ 4η t γ ρ ∥w t -∇f (x t , y t ) -R t ∥ 2 + 4η t γ ρ ∥R t ∥ 2 + L 2 0 η t γ ρ ∥y * (x t ) -y t ∥ 2 - ρη t 2γ ∥x t+1 -x t ∥ 2 . According to Lemma 10, we have ∥y t+1 -y * (x t+1 )∥ 2 -∥y t -y * (x t )∥ 2 (57) ≤ - η t µλ 4b t ∥y t -y * (x t )∥ 2 - 3η t 4 ∥ỹ t+1 -y t ∥ 2 + 25η t λ 6µb t ∥∇ y g(x t , y t ) -v t ∥ 2 + 25κ 2 η t b t 6µλ ∥x t+1 -x t ∥ 2 . Next, we define a Lyapunov function (i.e., potential function), for any t ≥ 1, Γ t = E F (x t ) + 5b t L 2 0 γ λµρ ∥y t -y * (x t )∥ 2 + γ ρ ∥v t -∇ y g(x t , y t )∥ 2 + ∥w t -∇f (x t , y t ) -R t ∥ 2 . For notational simplicity, let L 2 1 = 12L 2 g µ 2 125L 2 0 + 2L 2 0 3 . Then we have Γ t+1 -Γ t = F (x t+1 ) -F (x t ) + 5b t L 2 0 γ λµρ ∥y t+1 -y * (x t+1 )∥ 2 -∥y t -y * (x t )∥ 2 + γ ρ ∥v t+1 -∇ y g(x t+1 , y t+1 )∥ 2 -∥v t -∇ y g(x t , y t )∥ 2 + ∥w t+1 -∇f (x t+1 , y t+1 ) -R t+1 ∥ 2 -∥w t -∇f (x t , y t ) -R t ∥ 2 ≤ L 2 0 η t γ ρ ∥y * (x t ) -y t ∥ 2 + 4η t γ ρ ∥w t -∇f (x t , y t ) -R t ∥ 2 + 4η t γ ρ ∥R t ∥ 2 - ρη t 2γ ∥x t+1 -x t ∥ 2 + 5b t L 2 0 γ λµρ - η t µλ 4b t ∥y t -y * (x t )∥ 2 - 3η t 4 ∥ỹ t+1 -y t ∥ 2 + 25η t λ 6µb t ∥∇ y g(x t , y t ) -v t ∥ 2 + 25κ 2 η t b t 6µλ ∥x t+1 -x t ∥ 2 + γ ρ - 125L 2 0 6µ 2 E∥∇ y g(x t , y t ) -v t ∥ 2 + 12L 2 g µ 2 η t 125L 2 0 ∥x t+1 -x t ∥ 2 + ∥ỹ t+1 -y t ∥ 2 + mη 2 t σ 2 k 2 - 9η t 2 E∥w t -∇f (x t , y t ) -R t ∥ 2 + 2L 2 0 η t 3 ∥x t+1 -x t ∥ 2 + ∥ỹ t+1 -y t ∥ 2 + 2 3η t ∥R t ∥ 2 + ∥R t+1 ∥ 2 + mη 2 t σ 2 k 2 = - γη t 4ρ L 2 0 ∥y t -y * (x t )∥ 2 + 2E∥w t -∇f (x t , y t ) -R t ∥ 2 - ρ 2γ - L 2 1 γ ρ - 125b 2 t L 2 0 κ 2 γ 6µ 2 λ 2 ρ η t ∥x t+1 -x t ∥ 2 - 15b t L 2 0 γ 4λµρ - L 2 1 γ ρ η t ∥ỹ t+1 -y t ∥ 2 + + 2mγσ 2 k 2 ρ η 2 t + 2γ 3ρη t ∥R t ∥ 2 + ∥R t+1 ∥ 2 + 4η t γ ρ ∥R t ∥ 2 ≤ - γη t 4ρ L 2 0 ∥y t -y * (x t )∥ 2 + 2E∥w t -∇f (x t , y t ) -R t ∥ 2 - ρη t 4γ ∥x t+1 -x t ∥ 2 + 2mγσ 2 k 2 ρ η 2 t + 2γ 3ρη t ∥R t ∥ 2 + ∥R t+1 ∥ 2 + 4η t γ ρ ∥R t ∥ 2 , where the first inequality holds by the above inequalities (54), ( 55), ( 56) and (57); the last inequality is due to 0 < γ ≤ √ 6λµρ √ 6L 2 1 λ 2 µ 2 +125b 2 u L 2 0 κ 2 ≤ √ 6λµρ √ 6L 2 1 λ 2 µ 2 +125b 2 t L 2 0 κ 2 , 0 < λ ≤ 15b l L 2 0 4L 2 1 µ ≤ 15btL 2 0 4L 2 1 µ for all t ≥ 1. Let Φ t = L 2 0 ∥y t -y * (x t )∥ 2 + 2∥w t -∇f (x t , y t ) -R t ∥ 2 , we have γη t 4ρ Φ t + ρη t 4γ ∥x t+1 -x t ∥ 2 ≤ Γ t -Γ t+1 + 2mγσ 2 k 2 ρ η 2 t + 2γ 3ρη t ∥R t ∥ 2 + ∥R t+1 ∥ 2 + 4γη t ρ ∥R t ∥ 2 . Taking average over t = 1, 2, • • • , T on both sides of (59), we have 1 T T t=1 E η t 4 Φ t + ρ 2 η t 4γ 2 ∥x t+1 -x t ∥ 2 ≤ T t=1 ρ(Γ t -Γ t+1 ) T γ + 1 T T t=1 2mσ 2 k 2 η 2 t + 1 T T t=1 2 3η t ∥R t ∥ 2 + ∥R t+1 ∥ 2 + 4η t ∥R t ∥ 2 . Given x 1 ∈ X and y 1 ∈ Y, let ∆ 0 = ∥y 1 -y * (x 1 )∥ 2 , we have Γ 1 = E F (x t ) + 5b 1 L 2 0 γ λµρ ∥y 1 -y * (x 1 )∥ 2 + γ ρ ∥v 1 -∇ y g(x 1 , y 1 )∥ 2 + ∥w 1 -∇f (x 1 , y 1 ) -R 1 ∥ 2 ≤ F (x 1 ) + 5b 1 L 2 0 γ∆ 0 λµρ + 2γσ 2 ρ , where the last inequality holds by Assumption 2. Since η t is decreasing on t, i.e., η -1 T ≥ η -1 t for any 0 ≤ t ≤ T , we have 1 T T t=1 E Φ t 4 + ρ 2 4γ 2 ∥x t+1 -x t ∥ 2 ≤ ρ T γη T T t=1 Γ t -Γ t+1 + 1 T η T T t=1 2mσ 2 k 2 η 2 t + 1 T T t=1 2 3η t ∥R t ∥ 2 + ∥R t+1 ∥ 2 + 4η t ∥R t ∥ 2 ≤ ρ T γη T F (x 1 ) + 5b 1 L 2 0 γ∆ 0 λµρ + 2γσ 2 ρ -F * + 1 T η T T t=1 2mσ 2 k 2 η 2 t + 2 3T 3 T t=1 1 η t + 4 T 2 T t=1 η t ≤ ρ(F (x 1 ) -F * ) T γη T + 5b 1 L 2 0 ∆ 0 T η T λµ + 2σ 2 T η T + 2mσ 2 T η T k 2 T 1 k 2 m + t dt + 2 3T 3 T 1 (m + t) 1/2 k dt + 4 T 2 T 1 k (m + t) 1/2 dt ≤ ρ(F (x 1 ) -F * ) T γη T + 5b 1 L 2 0 ∆ 0 T η T λµ + 2σ 2 T η T + 2mσ 2 T η T ln(m + T ) + 4 9kT 3 (m + T ) 3/2 + 8k T 2 (m + T ) 1/2 = ρ(F (x 1 ) -F * ) kγ + 5b 1 L 2 0 ∆ 0 kλµ + 2σ 2 k + 2mσ 2 k ln(m + T ) + 4(m + T ) 9kT 2 + 8k T (m + T ) 1/2 T , where the second inequality holds by the above inequality (60) and ∥R t ∥ ≤ 1 T for all t ≥ 1 by choosing K = Lg µ log(C gxy C f y T /µ) in Algorithm 1. Let G = F (x1)-F * ρkγ + 5b1L 2 0 ∆0 ρ 2 kλµ + 2σ 2 ρ 2 k + 2mσ 2 ρ 2 k ln(m + T ) + 4(m+T ) 9ρ 2 kT 2 + 8k ρ 2 T , we have 1 T T t=1 E Φ t 4ρ 2 + 1 4γ 2 ∥x t+1 -x t ∥ 2 ≤ G T (m + T ) 1/2 . ( ) According to the Jensen's inequality, we have 1 T T t=1 E 1 2γ ∥x t+1 -x t ∥ + 1 2ρ L 0 ∥y t -y * (x t )∥ + √ 2∥w t -∇f (x t , y t ) -R t ∥ ≤ 3 T T t=1 1 4γ 2 ∥x t+1 -x t ∥ 2 + L 2 0 4ρ 2 ∥y t -y * (x t )∥ 2 + 2 4ρ 2 E∥w t -∇f (x t , y t ) -R t ∥ 2 1/2 = 3 T T t=1 Φ t 4ρ 2 + 1 4γ 2 ∥x t+1 -x t ∥ 2 1/2 ≤ √ 3G T 1/2 (m + T ) 1/4 ≤ √ 3Gm 1/4 T 1/2 + √ 3G T 1/4 , where the last inequality is due to (a + b) 1/4 ≤ a 1/4 + b 1/4 for all a, b > 0. Thus we have 1 T T t=1 E 1 γ ∥x t+1 -x t ∥ + 1 ρ L 0 ∥y t -y * (x t )∥ + √ 2∥w t -∇f (x t , y t ) -R t ∥ ≤ 2 √ 3Gm 1/4 T 1/2 + 2 √ 3G T 1/4 . ( ) Since ∥w t -∇f (x t , y t ) -R t ∥ ≥ ∥w t -∇f (x t , y t )∥ -∥R t ∥, by the above inequality (63), we can obtain 1 T T t=1 E[G t ] = 1 T T t=1 E 1 γ ∥x t+1 -x t ∥ + 1 ρ L 0 ∥y t -y * (x t )∥ + √ 2∥w t -∇f (x t , y t )∥ ≤ 2 √ 3Gm 1/4 T 1/2 + 2 √ 3G T 1/4 + √ 2 T T t=1 E∥R t ∥ = 2 √ 3Gm 1/4 T 1/2 + 2 √ 3G T 1/4 + √ 2 T , where the last inequality is due to ∥R t ∥ ≤ 1 T for all t ≥ 1 by choosing K = Lg µ log(C gxy C f y T /µ) in Algorithm 1. According to the above inequality (11), we have 1 T T t=1 E||G X (x t , ∇F (x t ), γ)|| ≤ 1 T T t=1 E[G t ] ≤ 2 √ 3Gm 1/4 T 1/2 + 2 √ 3G T 1/4 + √ 2 T . ( ) Theorem 6. (Restatement of Theorem 3) Under the above Assumptions (1, 2, 4, 6, 7), in the Algorithm 1, given X = R d , η t = k (m+t) 1/2 for all t ≥ 0, α t+1 = c 1 η t , β t+1 = c 2 η t , m ≥ max k 2 , (c 1 k) 2 , (c 2 k) 2 , k > 0, 125L 2 0 6µ 2 ≤ c 1 ≤ m 1/2 k , 9 2 ≤ c 2 ≤ m 1/2 k , 0 < λ ≤ min 15b l L 2 0 4L 2 1 µ , b l 6Lg , 0 < γ ≤ min √ 6λµρ √ 6L 2 1 λ 2 µ 2 +125b 2 u L 2 0 κ 2 , m 1/2 ρ 4Lk and K = Lg µ log(C gxy C f y T /µ), we have 1 T T t=1 E∥∇F (x t )∥ ≤ 1 T T t=1 E∥A t ∥ 2 ρ 2 √ 6G ′ m T 1/2 + 2 √ 6G ′ T 1/4 + 2 √ 3 T , ( ) where G ′ = ρ(F (x1)-F * ) kγ + 5buL 2 0 ∆0 kλµ + 2σ 2 k + 2mσ 2 k ln(m + T ) + 4(m+T ) 9kT 2 + 8k T . Proof. According to Lemma 7, we have G t = 1 γ ∥x t -xt+1 ∥ + 1 ρ L 0 ∥y * (x t ) -y t ∥ + √ 2∥ ∇f (x t , y t ) -w t ∥ ≥ 1 γ ∥x t -xt+1 ∥ + 1 ρ ∥∇F (x t ) -w t ∥ (i) = ∥A -1 t w t ∥ + 1 ρ ∥∇F (x t ) -w t ∥ = 1 ∥A t ∥ ∥A t ∥∥A -1 t w t ∥ + 1 ρ ∥∇F (x t ) -w t ∥ ≥ 1 ∥A t ∥ ∥w t ∥ + 1 ρ ∥∇F (x t ) -w t ∥ (ii) ≥ 1 ∥A t ∥ ∥w t ∥ + 1 ∥A t ∥ ∥∇F (x t ) -w t ∥ ≥ 1 ∥A t ∥ ∥∇F (x t )∥, where the equality (i) holds by xt+1 = x t -γA -1 t w t that can be easily obtained from the step 5 of Algorithm 1 when X = R d , and the inequality (ii) holds by ∥A t ∥ ≥ ρ for all t ≥ 1 due to Assumption 7. Then we have ∥∇F (x t )∥ ≤ ∥A t ∥G t . According to Cauchy-Schwarz inequality, we have 1 T T t=1 E∥∇F (x t )∥ ≤ 1 T T t=1 E G t ∥A t ∥ ≤ 1 T T t=1 E[G 2 t ] 1 T T t=1 E∥A t ∥ 2 . ( ) Then we have 1 T T t=1 E[G 2 t ] ≤ 1 T T t=1 E 3L 2 0 ∥y t -y * (x t )∥ 2 ρ 2 + 6∥w t -∇f (x t , y t )∥ 2 ρ 2 + 3 γ 2 ∥x t+1 -x t ∥ 2 ≤ 1 T T t=1 E 3L 2 0 ∥y t -y * (x t )∥ 2 ρ 2 + 12∥w t -∇f (x t , y t ) -R t ∥ 2 ρ 2 + 12∥R t ∥ 2 ρ 2 + 3 γ 2 ∥x t+1 -x t ∥ 2 ≤ 24G T (m + T ) 1/2 + 1 T T t=1 12∥R t ∥ 2 ρ 2 ≤ 24G T (m + T ) 1/2 + 12 ρ 2 T 2 , ( ) where the third inequality holds by the above inequality (61), and the last inequality holds by ∥R t ∥ ≤ 1 T for all t ≥ 1 by choosing K = Lg µ log(C gxy C f y T /µ). By combining the inequalities ( 69) and ( 70), we have 1 T T t=1 E∥∇F (x t )∥ ≤ 1 T T t=1 E[G 2 t ] 1 T T t=1 E∥A t ∥ 2 ≤ 1 T T t=1 E∥A t ∥ 2 ρ 2 √ 6G ′ m T 1/2 + 2 √ 6G ′ T 1/4 + 2 √ 3 T , where G ′ = ρ 2 G.

A.6 CONVERGENCE ANALYSIS OF VR-BIADAM ALGORITHM

In the subsection, we detail convergence analysis of VR-BiAdam algorithm. Lemma 12. Under the above Assumptions (1, 3, 4), assume the stochastic gradient estimators v t and w t be generated from Algorithm 2, we have E∥∇ y g(x t+1 , y t+1 ) -v t+1 ∥ 2 ≤ (1 -α t+1 )E∥∇ y g(x t , y t ) -v t ∥ 2 + 2α 2 t+1 σ 2 + 4L 2 g η 2 t E∥x t+1 -x t ∥ 2 + E∥ỹ t+1 -y t ∥ 2 , ( ) E∥w t+1 -∇f (x t+1 , y t+1 ) -R t+1 ∥ 2 ≤ (1 -β t+1 )E∥w t -∇f (x t , y t ) -R t ∥ 2 + 2β 2 t+1 σ 2 + 4L 2 K η 2 t ∥x t+1 -x t ∥ 2 + ∥ỹ t+1 -y t ∥ 2 , ( ) where L 2 K = 2L 2 f + 6C 2 gxy L 2 f K 2µLg-µ 2 + 6C 2 f y L 2 gxy K 2µLg-µ 2 + 6C 2 gxy L 2 f K 3 L 2 gyy (Lg-µ) 2 (2µLg-µ 2 ) . Proof. Without loss of generality, we only prove the term E∥w t+1 -∇f (x t+1 , y t+1 ) -R t+1 ∥ 2 . The other term is similar for this term. Since w t+1 = ∇f (x t+1 , y t+1 ; ξt+1 ) + (1 -β t+1 ) w t -∇f (x t , y t ; ξt+1 ) , we have E∥w t+1 -∇f (x t+1 , y t+1 ) -R t+1 ∥ 2 (74) = E∥ ∇f (x t+1 , y t+1 ; ξt+1 ) + (1 -β t+1 ) w t -∇f (x t , y t ; ξt+1 ) -∇f (x t+1 , y t+1 ) -R t+1 ∥ 2 = E∥(1 -β t+1 )(w t -∇f (x t , y t ) -R t ) + β t+1 ∇f (x t+1 , y t+1 ; ξt+1 ) -∇f (x t+1 , y t+1 ) -R t+1 + (1 -β t+1 ) ∇f (x t+1 , y t+1 ; ξt+1 ) -∇f (x t+1 , y t+1 ) -R t+1 -( ∇f (x t , y t ; ξt )) -∇f (x t , y t ) -R t ) ∥ 2 = (1 -β t+1 ) 2 E∥w t -∇f (x t , y t ) -R t ∥ 2 + E∥β t+1 ∇f (x t+1 , y t+1 ; ξt+1 ) -∇f (x t+1 , y t+1 ) -R t+1 + (1 -β t+1 ) ∇f (x t+1 , y t+1 ; ξt+1 ) -∇f (x t+1 , y t+1 ) -R t+1 -( ∇f (x t , y t ; ξt ) -∇f (x t , y t ) -R t ) ∥ 2 ≤ (1 -β t+1 ) 2 E∥w t -∇f (x t , y t ) -R t ∥ 2 + 2β 2 t+1 E∥ ∇f (x t+1 , y t+1 ; ξt+1 ) -∇f (x t+1 , y t+1 ) -R t+1 ∥ 2 + 2(1 -β t+1 ) 2 ∥ ∇f (x t+1 , y t+1 ; ξt+1 ) -∇f (x t+1 , y t+1 ) -R t+1 -( ∇f (x t , y t ; ξt )) -∇f (x t , y t ) -R t )∥ 2 ≤ (1 -β t+1 ) 2 E∥w t -∇f (x t , y t ) -R t ∥ 2 + 2β 2 t+1 σ 2 + 2(1 -β t+1 ) 2 ∥ ∇f (x t+1 , y t+1 ; ξt+1 )) -∇f (x t , y t ; ξt )∥ 2 ≤ (1 -β t+1 ) 2 E∥w t -∇f (x t , y t ) -R t ∥ 2 + 2β 2 t+1 σ 2 + 4(1 -β t+1 ) 2 L 2 K ∥x t+1 -x t ∥ 2 + ∥y t+1 -y t ∥ 2 ≤ (1 -β t+1 )E∥w t -∇f (x t , y t ) -R t ∥ 2 + 2β 2 t+1 σ 2 + 4L 2 K η 2 t ∥x t+1 -x t ∥ 2 + ∥ỹ t+1 -y t ∥ 2 , where the third equality holds by Eξ ∇f (x t+1 , y t+1 ; ξt+1 ) = ∇f (x t+1 , y t+1 ) + R t+1 and Eξ ∇f (x t , y t ; ξt )) = ∇f (x t , y t ) + R t ; the third last inequality holds by the inequality E∥ζ -E[ζ]∥ 2 ≤ E∥ζ∥ 2 ; the second last inequality is due to Lemma 4; the last inequality holds by 0 < β t+1 ≤ 1 and x t+1 = x t + η t (x t+1 -x t ), y t+1 = y t + η t (ỹ t+1 -y t ). Theorem 7. (Restatement of Theorem 2) Under the above Assumptions (1, 3, 4, 6, 7)  , in the Al- gorithm 2, given X ⊂ R d , η t = k (m+t) 1/3 for all t ≥ 0, α t+1 = c 1 η 2 t , β t+1 = c 2 η 2 t , m ≥ max 2, k 3 , (c 1 k) 3 , (c 2 k) 3 , k > 0, c 1 ≥ 2 3k 3 + 125L 2 0 6µ 2 , c 2 ≥ 2 3k 3 + 9 2 , 0 < λ ≤ min 15b l L 2 0 16L 2 2 µ , b l 6Lg , 0 < γ ≤ min √ 6λµρ 2 √ 24L 2 2 λ 2 µ 2 +125b 2 u L 2 0 κ 2 , m 1/3 ρ 4Lk and K = Lg µ log(C gxy C f y T /µ), we have 1 T T t=1 E||G X (x t , ∇F (x t ), γ)|| ≤ 1 T T t=1 E[G t ] ≤ 2 √ 3M m 1/6 T 1/2 + 2 √ 3M T 1/3 + √ 2 T , where M = F (x1)-F * ρkγ + 5b1L 2 0 ∆0 ρ 2 kλµ + 2m 1/3 σ 2 ρ 2 k 2 + 2k 2 (c 2 1 +c 2 2 )σ 2 ln(m+T ) ρ 2 + 6k(m+T ) 1/3 ρ 2 T , ∆ 0 = ∥y 1 - y * (x 1 )∥ 2 and L 2 2 = L 2 g + L 2 K . Proof. Since η t = k (m+t) 1/3 on t is decreasing and m ≥ k 3 , we have η t ≤ η 0 = k m 1/3 ≤ 1 and γ ≤ m 1/3 ρ 4Lk ≤ ρ 2Lη0 ≤ ρ 2Lηt for any t ≥ 0. Due to 0 < η t ≤ 1 and m ≥ (c 1 k) 3 , we have α t+1 = c 1 η 2 t ≤ c 1 η t ≤ c1k m 1/3 ≤ 1. Similarly, due to m ≥ (c 2 k) 3 , we have β t+1 ≤ 1. According to Lemma 12, we have 1 η t E∥∇ y g(x t+1 , y t+1 ) -v t+1 ∥ 2 - 1 η t-1 E∥∇ y g(x t , y t ) -v t ∥ 2 (76) ≤ 1 -α t+1 η t - 1 η t-1 E∥∇ y g(x t , y t ) -v t ∥ 2 + 4L 2 g η t ∥x t+1 -x t ∥ 2 + ∥ỹ t+1 -y t ∥ 2 + 2α 2 t+1 σ 2 η t = 1 η t - 1 η t-1 -c 1 η t E∥∇ y g(x t , y t ) -v t ∥ 2 + 4L 2 g η t ∥x t+1 -x t ∥ 2 + ∥ỹ t+1 -y t ∥ 2 + 2c 2 1 η 3 t σ 2 , where the second equality is due to α t+1 = c 1 η 2 t . Similarly, we have 1 η t E∥w t+1 -∇f (x t+1 , y t+1 ) -R t+1 ∥ 2 - 1 η t-1 E∥w t -∇f (x t , y t ) -R t ∥ 2 (77) ≤ 1 -β t+1 η t - 1 η t-1 E∥w t -∇f (x t , y t ) -R t ∥ 2 + 4L 2 K η t ∥x t+1 -x t ∥ 2 + ∥ỹ t+1 -y t ∥ 2 + 2β 2 t+1 σ 2 η t = 1 η t - 1 η t-1 -c 2 η t E∥w t -∇f (x t , y t ) -R t ∥ 2 + 4L 2 K η t ∥x t+1 -x t ∥ 2 + ∥ỹ t+1 -y t ∥ 2 + 2c 2 2 η 3 t σ 2 . By η t = k (m+t) 1/3 , we have 1 η t - 1 η t-1 = 1 k (m + t) 1 3 -(m + t -1) 1 3 ≤ 1 3k(m + t -1) 2/3 ≤ 1 3k m/2 + t 2/3 ≤ 2 2/3 3k(m + t) 2/3 = 2 2/3 3k 3 k 2 (m + t) 2/3 = 2 2/3 3k 3 η 2 t ≤ 2 3k 3 η t , where the first inequality holds by the concavity of function f (x) = x 1/3 , i.e., (x + y) 1/3 ≤ x 1/3 + y 3x 2/3 ; the second inequality is due to m ≥ 2, and the last inequality is due to 0 < η t ≤ 1. Let c 1 ≥ 2 3k 3 + 125L 2 0 6µ 2 , we have 1 η t E∥∇ y g(x t+1 , y t+1 ) -v t+1 ∥ 2 - 1 η t-1 E∥∇ y g(x t , y t ) -v t ∥ 2 (79) ≤ - 125L 2 0 η t 6µ 2 E∥∇ y g(x t , y t ) -v t ∥ 2 + 4L 2 g η t ∥x t+1 -x t ∥ 2 + ∥ỹ t+1 -y t ∥ 2 + 2c 2 1 η 3 t σ 2 . Let c 2 ≥ 2 3k 3 + 9 2 , we have 1 η t E∥w t+1 -∇f (x t+1 , y t+1 ) -R t+1 ∥ 2 - 1 η t-1 E∥w t -∇f (x t , y t ) -R t ∥ 2 (80) ≤ - 9η t 2 E∥w t -∇f (x t , y t ) -R t ∥ 2 + 4L 2 K η t ∥x t+1 -x t ∥ 2 + ∥ỹ t+1 -y t ∥ 2 + 2c 2 2 η 3 t σ 2 . According to Lemmas 7 and 9, we have F (x t+1 ) -F (x t ) ≤ 2η t γ ρ ∥w t -∇f (x t , y t )∥ 2 + L 2 0 η t γ ρ ∥y * (x t ) -y t ∥ 2 - ρη t 2γ ∥x t+1 -x t ∥ 2 ≤ 4η t γ ρ ∥w t -∇f (x t , y t ) -R t ∥ 2 + 4η t γ ρ ∥R t ∥ 2 + L 2 0 η t γ ρ ∥y * (x t ) -y t ∥ 2 - ρη t 2γ ∥x t+1 -x t ∥ 2 . According to Lemma 10, we have ∥y t+1 -y * (x t+1 )∥ 2 -∥y t -y * (x t )∥ 2 (82) ≤ - η t µλ 4b t ∥y t -y * (x t )∥ 2 - 3η t 4 ∥ỹ t+1 -y t ∥ 2 + 25η t λ 6µb t ∥∇ y g(x t , y t ) -v t ∥ 2 + 25κ 2 η t b t 6µλ ∥x t+1 -x t ∥ 2 . Next, we define a Lyapunov function, for any t ≥ 1 Θ t = E F (x t ) + 5b t L 2 0 γ λµρ ∥y t -y * (x t )∥ 2 + γ ρη t-1 ∥v t -∇ y g(x t , y t )∥ 2 + ∥w t -∇f (x t , y t ) -R t ∥ 2 . For notational simplicity, let L 2 2 = L 2 g + L 2 K . Then we have Θ t+1 -Θ t = F (x t+1 ) -F (x t ) + 5b t L 2 0 γ λµρ ∥y t+1 -y * (x t+1 )∥ 2 -∥y t -y * (x t )∥ 2 + γ ρ 1 η t E∥v t+1 -∇ y g(x t+1 , y t+1 )∥ 2 - 1 η t-1 E∥v t -∇ y g(x t , y t )∥ 2 + 1 η t E∥w t+1 -∇f (x t+1 , y t+1 ) -R t+1 ∥ 2 - 1 η t-1 E∥w t -∇f (x t , y t ) -R t ∥ 2 ≤ L 2 0 η t γ ρ ∥y * (x t ) -y t ∥ 2 + 4η t γ ρ ∥w t -∇f (x t , y t ) -R t ∥ 2 + 4η t γ ρ ∥R t ∥ 2 - ρη t 2γ ∥x t+1 -x t ∥ 2 + 5b t L 2 0 γ λµρ - η t µλ 4b t ∥y t -y * (x t )∥ 2 - 3η t 4 ∥ỹ t+1 -y t ∥ 2 + 25η t λ 6µb t ∥∇ y g(x t , y t ) -v t ∥ 2 + 25κ 2 η t b t 6µλ ∥x t+1 -x t ∥ 2 + γ ρ - 125L 2 0 η t 6µ 2 E∥∇ y g(x t , y t ) -v t ∥ 2 + 4L 2 g η t ∥x t+1 -x t ∥ 2 + ∥ỹ t+1 -y t ∥ 2 + 2c 2 1 η 3 t σ 2 - 9η t 2 E∥w t -∇f (x t , y t ) -R t ∥ 2 + 4L 2 K η t ∥x t+1 -x t ∥ 2 + ∥ỹ t+1 -y t ∥ 2 + 2c 2 2 η 3 t σ 2 = - γη t 4ρ L 2 0 ∥y t -y * (x t )∥ 2 + 2E∥w t -∇f (x t , y t ) -R t ∥ 2 - ρ 2γ - 4L 2 2 γ ρ - 125b 2 t L 2 0 κ 2 γ 6µ 2 λ 2 ρ η t ∥x t+1 -x t ∥ 2 - 15b t L 2 0 γ 4λµρ - 4L 2 2 γ ρ η t ∥ỹ t+1 -y t ∥ 2 + 4η t γ ρ ∥R t ∥ 2 + 2(c 2 1 + c 2 2 )γσ 2 ρ η 3 t ≤ - γη t 4ρ L 2 0 ∥y t -y * (x t )∥ 2 + 2E∥w t -∇f (x t , y t ) -R t ∥ 2 - ρη t 4γ ∥x t+1 -x t ∥ 2 + 4η t γ ρ ∥R t ∥ 2 + 2(c 2 1 + c 2 2 )γσ 2 ρ η 3 t , where the first inequality holds by the above inequalities (79), ( 80), ( 81) and (82); the last inequality is due to 0 < γ ≤ √ 6λµρ 2 √ 24L 2 2 λ 2 µ 2 +125b 2 u L 2 0 κ 2 ≤ √ 6λµρ 2 √ 24L 2 2 λ 2 µ 2 +125b 2 t L 2 0 κ 2 , 0 < λ ≤ 15b l L 2 0 16L 2 2 µ ≤ 15btL 2 0 16L 2 µ for all t ≥ 1. Let Φ t = L 2 0 ∥y t -y * (x t )∥ 2 + 2∥w t -∇f (x t , y t ) -R t ∥ 2 , then we have γη t 4ρ E Φ t + ρη t 4γ ∥x t+1 -x t ∥ 2 ≤ Θ t -Θ t+1 + 4η t γ ρ ∥R t ∥ 2 + 2(c 2 1 + c 2 2 )γσ 2 ρ η 3 t . Taking average over t = 1, 2, • • • , T on both sides of (84), we have 1 T T t=1 E η t 4 Φ t + ρ 2 η t 4γ 2 ∥x t+1 -x t ∥ 2 ≤ T t=1 ρ(Θ t -Θ t+1 ) T γ + 4 T T t=1 η t ∥R t ∥ 2 + 2(c 2 1 + c 2 2 )σ 2 T T t=1 η 3 t . Given x 1 ∈ X and y 1 ∈ Y, let ∆ 0 = ∥y 1 -y * (x 1 )∥ 2 , we have Θ 1 = E F (x 1 ) + 5b 1 L 2 0 γ λµρ ∥y 1 -y * (x 1 )∥ 2 + γ ρη 0 ∥v 1 -∇ y g(x 1 , y 1 )∥ 2 + ∥w 1 -∇f (x 1 , y 1 ) -R 1 ∥ 2 ≤ F (x 1 ) + 5b 1 L 2 0 γ∆ 0 λµρ + 2γσ 2 ρη 0 , where the last inequality holds by Assumption 2. Since η t is decreasing, i.e., η -1 T ≥ η -1 t for any 0 ≤ t ≤ T , we have 1 T T t=1 E Φ t 4 + ρ 2 4γ 2 ∥x t+1 -x t ∥ 2 (86) ≤ ρ T γη T T t=1 Θ t -Θ t+1 + 2(c 2 1 + c 2 2 )σ 2 T η T T t=1 η 3 t + 4 T T t=1 η t ∥R t ∥ 2 ≤ ρ T γη T F (x 1 ) + 5b 1 L 2 0 γ∆ 0 λµρ + 2γσ 2 ρη 0 -F * + 2(c 2 1 + c 2 2 )σ 2 T η T T t=1 η 3 t + 4 T 2 T t=1 η t ≤ ρ(F (x 1 ) -F * ) T γη T + 5b 1 L 2 0 ∆ 0 λµη T T + 2σ 2 η 0 η T T + 2(c 2 1 + c 2 2 )σ 2 T η T T 1 k 3 m + t dt + 4 T 2 T 1 k (m + t) 1/3 dt ≤ ρ(F (x 1 ) -F * ) T γη T + 5b 1 L 2 0 ∆ 0 λµη T T + 2σ 2 η 0 η T T + 2k 3 (c 2 1 + c 2 2 )σ 2 T η T ln(m + T ) + 6k T 2 (m + T ) 2/3 = ρ(F (x 1 ) -F * ) kγ + 5b 1 L 2 0 ∆ 0 kλµ + 2m 1/3 σ 2 k 2 + 2k 2 (c 2 1 + c 2 2 )σ 2 ln(m + T ) + 6k(m + T ) 1/3 T (m + T ) 1/3 T , where the second inequality holds by the above inequality (85). Let M = F (x1  Φ t 4ρ 2 + 1 4γ 2 ∥x t+1 -x t ∥ 2 1/2 ≤ √ 3M T 1/2 (m + T ) 1/6 ≤ √ 3M m 1/6 T 1/2 + √ 3M T 1/3 , where the last inequality is due to (a + b) 1/6 ≤ a 1/6 + b 1/6 for all a, b > 0. Thus we have 1 T T t=1 E 1 γ ∥x t+1 -x t ∥ + 1 ρ L 0 ∥y t -y * (x t )∥ + √ 2∥w t -∇f (x t , y t ) -R t ∥ ≤ 2 √ 3M m 1/6 T 1/2 + 2 √ 3M T 1/3 . ( ) Since ∥w t -∇f (x t , y t ) -R t ∥ ≥ ∥w t -∇f (x t , y t )∥ -∥R t ∥, by the above inequality (89), we can obtain 1 T T t=1 E 1 γ ∥x t+1 -x t ∥ + 1 ρ L 0 ∥y t -y * (x t )∥ + √ 2∥w t -∇f (x t , y t ) -R t ∥ ≤ 2 √ 3M m 1/6 T 1/2 + 2 √ 3M T 1/3 + √ 2 T T t=1 ∥R t ∥ = 2 √ 3M m 1/6 T 1/2 + 2 √ 3M T 1/3 + √ 2 T , where the last inequality is due to ∥R t ∥ ≤ 1 T for all t ≥ 1 by choosing K = Lg µ log(C gxy C f y T /µ) in Algorithm 2. According to the above inequality (11), we have 1 T T t=1 E||G X (x t , ∇F (x t ), γ)|| ≤ 1 T T t=1 E[G t ] ≤ 2 √ 3M m 1/6 T 1/2 + 2 √ 3M T 1/3 + √ 2 T . ( ) Theorem 8. (Restatement of Theorem 4) Under the above Assumptions (1, 3, 4, 6, 7) , in the Algorithm 2, given X = R d , η t = k (m+t) 1/3 for all t ≥ 0, α t+1 = c 1 η 2 t , β t+1 = c 2 η 2 t , m ≥ max 2, k 3 , (c 1 k) 3 , (c 2 k) 3 , k > 0, c 1 ≥ 2 3k 3 + 125L 2 0 6µ 2 , c 2 ≥ 2 3k 3 + 9 2 , 0 < λ ≤ min 15b l L 2 0 16L 2 2 µ , b l 6Lg , 0 < γ ≤ min √ 6λµρ 2 √ 24L 2 2 λ 2 µ 2 +125b 2 u L 2 0 κ 2 , m 1/3 ρ 4Lk and K = Lg µ log(C gxy C f y T /µ), we have 1 T T t=1 E∥∇F (x t )∥ ≤ 1 T T t=1 E∥A t ∥ 2 ρ 2 √ 6M ′ m T 1/2 + 2 √ 6M ′ T 1/3 + 2 √ 3 T , ( ) where  M ′ = ρ(F (x1)-F * ) kγ + 5b1L 2 0 ∆0 kλµ + 2m 1/3 σ 2 k 2 + 2k 2 (c 2 1 + c 2 2 ) where the equality (i) holds by xt+1 = x t -γA -1 t w t that can be easily obtained from the step 5 of Algorithm 2 when X = R d , and the inequality (ii) holds by ∥A t ∥ ≥ ρ for all t ≥ 1 due to Assumption 7. Then we have ∥∇F (x t )∥ ≤ ∥A t ∥G t . (94) According to Cauchy-Schwarz inequality, we have 1 T T t=1 E∥∇F (x t )∥ ≤ 1 T T t=1 E G t ∥A t ∥ ≤ 1 T T t=1 E[G 2 t ] 1 T T t=1 E∥A t ∥ 2 . ( ) Then we have  1 T T t=1 E[G 2 t ] ≤ 1 T T t=1 ≤ 24M T (m + T ) 1/3 + 1 T T t=1 12∥R t ∥ 2 ρ 2 ≤ 24M T (m + T ) 1/3 + 12 ρ 2 T 2 , ( ) where the third inequality holds by the above inequality (87), and the last inequality holds by ∥R t ∥ ≤ 1 T for all t ≥ 1 by choosing K = Lg µ log(C gxy C f y T /µ). By combining the above inequalities ( 95) and ( 96), we have 1 T T t=1 E∥∇F (x t )∥ ≤ 1 T T t=1 E[G 2 t ] 1 T T t=1 E∥A t ∥ 2 ≤ 1 T T t=1 E∥A t ∥ 2 ρ 2 √ 6M ′ m T 1/2 + 2 √ 6M ′ T 1/3 + 2 √ 3 T , where M ′ = ρ 2 M .



(x) := E ξ∼D f x, y * (x); ξ

Assumptions 1-5 are commonly used in stochastic bilevel optimization problems Ghadimi & Wang (2018); Hong et al. (2020); Ji et al. (2021); Chen et al. (2022); Khanduri et al. (

we define a prox-function (a.k.a., Bregman distance)Censor & Lent (1981);Censor & Zenios (1992);Ghadimi et al. (

Without loss of generality, let k = O(1) and m = O(1), we have M = O(ln(m + T )) = Õ(1). Thus our VR-BiAdam algorithm has a convergence rate of Õ( 1

Figure 1: Validation Loss vs. Running Time. We test three values of ρ: 0.8, 0.6, 0.2 from left to right. Larger value of ρ represents a more noisy setting.

Figure 2. As shown by the figure, both our BiAdam and VR-BiAdam algorithms outperform other baselines.

Figure 2: Test Accuracy vs. Running Time. The plots corresponds to 5-way-1-shot, 5-way-5-shot, 20-way-1-shot and 20-way-5-shots from left to right.

A.2.1 BILEVEL OPTIMIZATION METHODSBilevel optimization has shown successes in many machine learning problems with hierarchical structures such as policy optimizationHong et al. (2020), model-agnostic meta-learningLiu et al.  (2021a);Ji et al. (2021);Lu et al. (2022) and adversarial trainingZhang et al. (2021). Thus, its researches have become active in the machine learning community, and some bilevel optimization methods recently have been proposed. For example, one class of successful methodsColson et al. (2007);Kunapuli et al. (2008) are to reformulate the bilevel problem as a single-level problem by replacing the inner problem by its optimality conditions. Another class of successful methodsGhadimi & Wang (2018);Hong et al. (2020);Ji et al. (2021);Chen et al. (2022);Khanduri et al. (2021);Guo & Yang (2021);Liu et al. (2021b; 2022);Li et al. (2021) for bilevel optimization are to iteratively approximate the (stochastic) gradient of the outer problem either in forward or backward. Specifically,Liu et al. (2022) proposed a general gradient-based descent aggregation framework for bilevel optimization. Moreover, the non-asymptotic analysis of these gradient-based bilevel optimization methods has been recently studied. For example,Ghadimi & Wang (2018) first studied the sample complexity of O(ϵ -6 ) of the proposed double-loop algorithm for the bilevel problem (1) (Please see Table1). Subsequently,Ji et al. (2021)  proposed an accelerated double-loop algorithm that reaches the sample complexity of O(ϵ -4 ) relying on large batches. At the same time, Hong et al. (2020) studied a single-loop algorithm that reaches the sample complexity of O(ϵ -5 ) without relying on large batches. Moreover, Khanduri et al. (2021); Guo & Yang (2021); Yang et al. (2021) proposed a class of accelerated single-loop methods for the bilevel problem (1) by using momentum-based variance reduced technique, which achieve the best known sample complexity of O(ϵ -3 ). More recently, Dagréou et al. (2022) proposed a novel framework for bilevel optimization based on the linear system. Meanwhile, Lu et al. (2022); Li et al. (2022); Tarzanagh et al. (2022) studied the distributed bilevel optimization. A.2.2 ADAPTIVE GRADIENT METHODS Adaptive gradient methods recently have been shown great successes in current machine learning problems such as training Deep Neural Networks (DNNs). Recently, thus many adaptive gradient methods Duchi et al. (2011); Kingma & Ba (2014); Loshchilov & Hutter (2018); Zhuang et al. (2020) have been developed and studied. For example, Adagrad Duchi et al. (2011) is the first adaptive gradient method that shows good performances under the sparse gradient setting. One variant of Adagrad, i.e., Adam Kingma & Ba (2014) is a very popular adaptive gradient method and basically is a default method of choice for training DNNs. Subsequently, some variants of Adam algorithm Reddi et al. (2019); Chen et al. (2019) have been developed and studied, and especially they have convergence guarantee under the nonconvex setting. At the same time, some adaptive gradient methods Loshchilov & Hutter (2018); Chen et al. (2018); Zhuang et al. (2020) have been presented to improve the generalization performance of Adam algorithm. The norm version of AdaGrad (i.e., AdaGrad-Norm) Ward et al. (2019) has been presented to accelerate AdaGrad without sacrificing generalization. Moreover, some accelerated adaptive gradient methods such as STORM Cutkosky & Orabona (2019) and SUPER-ADAM Huang et al. (2021) have been proposed by using variancereduced technique. Meanwhile, Huang et al. (2021); Guo et al. (2021a) studied the convergence analysis framework for adaptive gradient methods.

-x t ) T A t (x -x t ) , and x t+1 = x t + η t (x t+1 -x t ); -y t ) T B t (y -y t ) , and y t+1 = y t + η t (ỹ t+1 -y t );Compared with the estimator w t+1 in Algorithm 1, w t+1 in Algorithm 2 adds the term ∇f (x t+1 , y t+1 ; ξt+1 ) -∇f (x t , y t ; ξt+1 ) to control the variances of estimator.

)-F * E∥w t -∇f (x t , y t ) -R t ∥ 2

σ 2 ln(m + T ) + 6k(m+T ) 1/3

3L 2 0 ∥y t -y * (x t )∥ 2 ρ 2 + 6∥w t -∇f (x t , y t )∥ 2 ρ 2 + 3 γ 2 ∥x t+1 -x t ∥ 2

annex

Consider the upper bound of the term ⟨∇ y g(x t , y t ) -v t , y * (x t ) -ỹt+1 ⟩, we haveBy plugging the inequalities ( 43) and ( 44) into (41), we obtain 

