ON STABILITY AND GENERALIZATION OF BILEVEL OPTIMIZATION PROBLEM

Abstract

Stochastic) bilevel optimization is a frequently encountered problem in machine learning with a wide range of applications such as meta-learning, hyper-parameter optimization, and reinforcement learning. Most of the existing studies on this problem only focused on analyzing the convergence or improving the convergence rate, while little effort has been devoted to understanding its generalization behaviors. In this paper, we conduct a thorough analysis on the generalization of first-order (gradient-based) methods for the bilevel optimization problem. We first establish a fundamental connection between algorithmic stability and generalization gap in different forms and give a high probability generalization bound which improves the previous best one from O( √ n) to O(log n), where n is the sample size. We then provide the first stability bounds for the general case where both inner and outer level parameters are subject to continuous update, while existing work allows only the outer level parameter to be updated. Our analysis can be applied in various standard settings such as strongly-convex-strongly-convex (SC-SC), convex-convex (C-C), and nonconvex-nonconvex (NC-NC). Our analysis for the NC-NC setting can also be extended to a particular nonconvex-stronglyconvex (NC-SC) setting that is commonly encountered in practice. Finally, we corroborate our theoretical analysis and demonstrate how iterations can affect the generalization gap by experiments on meta-learning and hyper-parameter optimization.

1. INTRODUCTION

(Stochastic) bilevel optimization is a widely confronted problem in machine learning with various applications such as meta-learning (Finn et al., 2017; Bertinetto et al., 2018; Rajeswaran et al., 2019) , hyper-parameter optimization (Franceschi et al., 2018; Shaban et al., 2019; Baydin et al., 2017; Bergstra et al., 2011; Luketina et al., 2016) , reinforcement learning (Hong et al., 2020) , and few-shot learning (Koch et al., 2015; Santoro et al., 2016; Vinyals et al., 2016) . The basic form of this problem can be defined as follows min x∈R d 1 R(x) = F (x, y * (x)) := E ξ [f (x, y * (x); ξ)] s.t. y * (x) = arg min y∈R d 2 {G(x, y) := E ζ [g(x, y; ζ)]} , where f : R d1 × R d2 → R and g : R d1 × R d2 → R are two continuously differentiable loss functions with respect to x and y. Problem (1) has an optimization hierarchy of two levels, where the outer-level objective function f depends on the minimizer of the inner-level objective function g. Due to its importance, the above bilevel optimization problem has received considerable attention in recent years. A natural way to solve problem (1) is to apply alternating stochastic gradient updates with approximating ∇ y g(x, y) and ∇f (x, y), respectively. Briefly speaking, previous efforts mainly examined two types of methods to perceive an approximate solution that is close to the optimum y * (x). One is to utilize the single-timescale strategy (Chen et al., 2021; Guo et al., 2021; Khanduri et al., 2021; Hu et al., 2022) , where the updates for y and x are carried out simultaneously. The other one is to apply the two-timescale strategy (Ghadimi & Wang, 2018; Ji et al., O((κ4) K /m1) O((κ4) K /m1) O(T 1-κ 5 K κ 5 /m1) O(T 1-κ 6 K κ 6 /m1) Table 1 : Summary of main results. κi: a constant for all i above; T : inner iterations; K: outer iterations; m1: size of outer dataset. SSGD and TSGD stand for Algorithm 1 and Algorithm 2, the single-timescale and two-timescale methods, via stochastic gradient descent. Hong et al., 2020; Pedregosa, 2016) , where the update of y is repeated multiple times to achieve a more accurate approximation before conducting the update of x. While there is a long list of work on bilevel optimization, most of the existing work only focuses on either analyzing its convergence behaviors (Ghadimi & Wang, 2018; Hong et al., 2020; Ji et al., 2021) or improving its convergence rate, based on the convexity and the smoothness properties of f (•, •) and/or g(•, •) (Liu et al., 2020; Li et al., 2020) . Contrarily, only little effort is devoted to understanding the generalization behavior of the problem. To the best of our knowledge, there is only one recent work on the generalization analysis for bilevel problems (Bao et al., 2021) , which presents the first expected uniform stability bound. However, there are still several undesirable issues in this work: (1) Their result is only for the uniform stability (which could be deduced from argument stability with certain conditions, see Definition 4 for details), leaving the analysis of other stronger definitions of algorithmic stability open; (2) Additionally, the UD algorithm allows the outer level parameters to be updated continuously but needs to reinitialize the inner level parameters before each iteration in the inner loop, which is not commonly used in practice due to their inefficiency (see line 4 in Algorithm 3). (3) The proof of Theorem 2 in their work is unclear to show whether the update of outer level parameters is argument dependent on the inner level parameters, where may exist some gap in the analysis of UD algorithm (see Appendix E for detailed discussions). ( 4)Their experiments take only hyper-parameter optimization into consideration and neglect other applications in the bilevel optimization instances. To address all the aforementioned issues, we give in this paper a thorough analysis on the generalization behaviors of first-order (gradient-based) methods for general bilevel optimization problem. We employ the recent advances of algorithmic stability to investigate the generalization behaviors in different settings. Specifically, our main contributions can be summarized as follows: • Firstly, we establish a fundamental connection between generalization gap and different notations of algorithmic stability (argument stability and uniform stability) for any randomized bilevel optimization algorithms in both expectation and high probability forms. Specifically, we show that the high probability form of the generalization gap bound can be improved from O( √ n) to O(log n) compared with the result in Bao et al. (2021) . • Next, we present the stability bounds for gradient-based methods with either singletimescale or two-timescale update strategy under different standard settings. To the best of our knowledge, this work provides the first stability bounds for the two-timescale (double loop) algorithms, which allows the accumulation of the sub-sampled gradients in the inner level. In detail, we consider the settings of strongly-convex-strongly-convex (SC-SC), convex-convex (C-C), and nonconvex-nonconvex (NC-NC), and further extend our analysis to a particular nonconvex-strongly-convex (NC-SC) setting that is widely appeared in practice. Table 1 is the summary of our main results. • Thirdly, we provide the first generalization bounds for the case where both the outer and inner level parameters are subject to continuous (iterative) changes. Compared to the previous work (Bao et al., 2021) , our work does not need the reinitialization step before each iteration in the inner level and hence our algorithm can carry over the last updated inner level parameters, which is more general and practical. • Finally, we conduct empirical studies to corroborate our theories via meta-learning and hyperparameter optimization, which are two applications of bilevel optimization. Due to space limitations, all the proofs and additional experiments are included in Appendix.

1.1. RELATED WORK

Research at the interface between generalization and the bilevel problem can be roughly classified into two categories. The first one includes all the research on bilevel optimization. In recent decades, extensive studies have been done on this topic, which suggests that bilevel optimization has a wide range of applications in machine learning such as hyper-parameter optimization (Franceschi et al., 2018; Lorraine & Duvenaud, 2018; Okuno et al., 2021) , meta learning (Bertinetto et al., 2018; Rajeswaran et al., 2019; Soh et al., 2020) and reinforcement learning (Yang et al., 2018; Tschiatschek et al., 2019) . Most of the existing work studies the problem from an optimization perspective. For example, Ghadimi & Wang (2018) ; Ji et al. (2021) provide the convergence rate analysis based on the nonconvex-strongly-convex assumption for the two functions f (•, •) and g(•, •). (Grazzi et al., 2020) considers the iteration complexity for hypergradient computation. (Liu et al., 2020; Li et al., 2020) present an asymptotic analysis for the convex-strongly-convex setting. Perhaps the most related one to ours from the generalization standpoint (i.e., the expectation of population risk and empirical risk) is Bao et al. (2021) , while there may exist some gap in the analysis of UD algorithm. In this work, we employ a novel approach to examine the stability bounds of bilevel optimization problems. Firstly, our work analyzes the generalization behavior by observing how different settings can have an impact on the stability bounds directly. Secondly, our work adopts a stronger version of stability called argument stability, which can imply the previously used uniform stability if the function is sufficiently smooth. Furthermore, our work does not need to reinitialize the inner-level parameters and allows them to carry over their last updated parameters at each time updating the inner level. This indicates that y in the inner level is updated iteratively and depends on the current parameter of x, which is more common and efficient in practice. The second category includes all the work on stability analysis. There is a long list of research on stability and generalization (Bousquet & Elisseeff, 2002; Mukherjee et al., 2006; Shalev-Shwartz et al., 2010) . Bousquet & Elisseeff (2002) first introduces the notion of uniform stability and establishes the first framework of stability analysis. Hardt et al. (2016) later extends the stability analysis to iterative algorithms based on stochastic gradient methods for the vanilla stochastic optimization. After that, there are subsequent studies on generalization analysis for various problems via algorithmic stability, such as minmax problems (Lei et al., 2021; Farnia & Ozdaglar, 2021; Zhang et al., 2021) and pairwise learning (Yang et al., 2021; Lei et al., 2020; Xue et al., 2021; Huai et al., 2020) . However, it is notable that due to the additional stochastic function in the constraint in the bilevel optimization, all the previous techniques and results cannot be applied to our problem. Although the generalization analysis of minmax optimization is somewhat similar to ours, it involves only one objective function f and a single level in algorithms for typical minmax optimization problems, while in the bilevel optimization algorithms there is an inner level and an outer level, which is considerably more challenging.

2.1. DEFINITIONS AND ASSUMPTIONS

In the following, we give some necessary definitions and assumptions that are widely used in bilevel optimization (Ghadimi & Wang, 2018; Ji et al., 2021; Khanduri et al., 2021) and generalization analysis (Hardt et al., 2016; Lei et al., 2021) . Definition 1 (Joint Lipschitz Continuity). A function f (x, y) is jointly L-Lipschitz over R d1 × R d2 , if for all x ∈ R d1 , y ∈ R d2 , the following holds, |f (x, y) -f (x ′ , y ′ )| ≤ L ∥x -x ′ ∥ 2 2 + ∥y -y ′ ∥ 2 2 . Definition 2 (Smoothness). A function f is l-smooth over a set S if for all u, w ∈ S the following is true, ∥∇f (u) -∇f (w)∥ ≤ l∥u -w∥. Definition 3 (Strong Convexity). A function f is µ-strongly-convex over a set S, if for all u, w ∈ S, the following holds, f (u) + ⟨∇f (u), w -u⟩ + µ 2 ∥w -u∥ 2 ≤ f (w). Assumption 1 (Inner-level Function Assumption). We assume the inner stochastic function g(x, y) in (1) satisfies the following: (i) g(x, y) is jointly L g -Lipschitz for any x ∈ R d1 and y ∈ R d2 . (ii) g(x, y) is continuously differentiable and l g -smooth for any (x, y) ∈ R d1 × R d2 . Assumption 2 (Outer-level Function Assumption). We assume the outer stochastic function f (x, y) in (1) satisfies the following: (iii) f (x, y) is jointly L f -Lipschitz for any x ∈ R d1 and y ∈ R d2 . (iv) f (x, y) is continuously differentiable and l f -smooth for any (x, y) ∈ R d1 × R d2 .

2.2. PROBLEM FORMULATION

Given two distributions D 1 and D 2 , in the (stochastic) optimization problem we aim to find the minimizer of Problem (1). However, since the distributions are often unknown, in practice we only have two finite-size datasets D m1 = {ξ i | i = 1, ..., m 1 } ∼ D m1 1 and D m2 = {ζ i | i = 1, ..., m 2 } ∼ D m2 2 , where each ξ i and ζ i are i.i.d. sampled from D 1 and D 2 , respectively. Based on these datasets, we will design some (randomized) algorithm A with output A(D m1 , D m2 ) = (x, y) ∈ R d1 × R d2 . Our goal is to investigate the generalization behavior of such output. Note that although there are two stochastic functions in the bilevel optimization problem, we only care about the generalization of the outer-level one since it is the one that we prefer to minimize. Below we define the generalization gap to measure the generalization behavior. Given distribution D 1 and a finite data D m1 ∼ D m1 1 , the population risk function R(x, y, D 1 ) of x, y on D 1 is defined as R(x, y, D 1 ) := E ξ∼D1 [f (x, y(x); ξ)], and its empirical risk function on D m1 is R s (x, y, D m1 ) = 1 m1 m1 i=1 [f (x, y(x); ξ i )]. Moreover, for a fixed hyperparameter x ∈ R d1 and y(x) ∈ R d2 ( note that y(x) might be dependent on x), we define the difference between the population risk and the empirical risk over (x, y(x)) as the bilevel generalization gap of (x, y(x)): E s [R(x, y) -R s (x, y)], where E s denotes the expectation of D m1 ∼ D m1 1 . When there is no ambiguity, we simplify thereafter the notations as follows: R(x, y, D 1 ) = R(x, y) and R s (x, y, D m1 ) = R s (x, y). Our goal is thus to analyze the bilevel generalization gap of the output of algorithm A(D m1 , D m2 ) based on D m1 and D m2 . Since the generalized error depends on the algorithm itself, in the following we will introduce the algorithms to be considered in this paper. Most of the existing algorithms adopt the following idea: first approximate y * on D m2 for a given parameter x in the inner level and then seek the hyperparameter x * (D m1 , D m2 ) with corresponding hypothesis y * (x * (D m1 , D m2 ), D m2 ) by the below estimation: x(D m1 , D m2 ) ≈ arg min x R s (x, ŷ(x, D m2 ), D m1 ), where ŷ(x, D m2 ) ≈ arg min y G s (x, y, D m2 ), where G s (x, y, D m2 ) is the empirical risk of G(x, y) over D m2 , i.e., G(x, y, D m2 ) = 1 m2 m2 i=1 g (x, y(x); ζ i ). Most of the current gradient-based (first-order) algorithms for approximating (2) can be categorized into two classes: single-timescale methods and two-timescale methods. The single-timescale method performs the updates for y and x simultaneously via stochastic gradient descent (SGD), while the two-timescale method updates y multiple times before updating x (via stochastic gradient descent). As there are numerous approaches for both classes (see Related Work section for details), in this paper we will analyze the generalization behaviors for the most classical and standard one in each class, i.e., single-timescale SGD (SSGD; Algorithm 1) and twotimescale SGD (TSGD; Algorithm 2). There is a long list of work (Chen et al., 2021) , (Ghadimi & Wang, 2018; Ji et al., 2021) based on either SSGD or TSGD.

3. GENERALIZATION AND STABILITY FOR BILEVEL OPTIMIZATION

Algorithmic stability is one of the classical approaches to analyzing the generalization bound for algorithms. Roughly speaking, the algorithmic stability of (randomized) algorithm A measures how the output of algorithm A changes if we change one data sample in the input dataset. While there are various notions of stability, most of the existing work on analyzing the stability of stochastic optimization, pairwise learning and minimax optimization focuses on the uniform-stability (Bousquet & Elisseeff, 2002) and the argument-stability (Liu et al., 2017; Lei & Ying, 2020) . Thus, we also adopt these two notions of stability for the bilevel optimization problem. Briefly speaking, uniformstability focuses on the resulting change in population risk function, while the argument-stability considers the resulting change in arguments, i.e., the output of the algorithm. Definition 4 (Algorithmic Stability). Let A : D m1 1 ×D m2 2 → R d1 ×R d2 be a randomized algorithm. Algorithm 1 Single-timescale SGD (SSGD) 1: Input: number of iterations K, step sizes α x , α y , initialization x 0 ,y 0 , Datasets D m1 and D m2 2: Output: x K , y K 3: for k = 0 to K -1 do 4: Uniformly sample i ∈ [m 2 ], j ∈ [m 1 ] 5: y k+1 = y k -α y ∇ y g(x k , y k (x k ); ζ i ) 6: x k+1 = x k -α x ∇f (x k , y k (x k ); ξ j ) 7: end for 8: return x K and y K Algorithm 2 Two-timescale SGD (TSGD) 1: Input: number of iterations K, step sizes α x , α y , initialization x 0 ,y 0 2: Output: x K , y K 3: for k = 0 to K -1 do 4: y 0 k ← y T k-1 5: for t = 0 to T -1 do 6: Uniformly sample i ∈ [m 2 ] 7: y t+1 k = y t k -α y ∇ y g(x k , y t k (x k ); ζ i ) 8: end for 9: Uniformly sample j ∈ [m 1 ] 10: x k+1 = x k -α x ∇f (x k , y T k (x k ); ξ j ) 11: end for 12: return x K , y T K (a) A is β-uniformly-stable if for all datasets D m1 , D ′ m1 ∼ D m1 1 and D m2 ∼ D m2 2 such that D m1 and D ′ m1 differ in at most one sample, we have the following for any ξ ∼ D 1 : E A [|f (A(D m1 , D m2 ), ξ) -f (A(D ′ m1 , D m2 ), ξ)|] ≤ β. A is β-uniformly-stable with probability at least 1 -δ if we have the following for any ξ ∼ D 1 with probability at least 1 -δ: f (A(D m1 , D m2 ), ξ) -f (A(D ′ m1 , D m2 ), ξ) ≤ β. (b) A is β-argument-stable in expectation if for all datasets D m1 , D ′ m1 ∼ D m1 1 and D m2 ∼ D m2 2 such that D m1 and D ′ m1 differ in at most one sample, we have: E A [∥A(D m1 , D m2 ) -A(D ′ m1 , D m2 )∥ 2 ] ≤ β. Note that the definition of uniform stability in expectation is the same as the definition in (Bao et al., 2021) . Thus, our other definitions can be considered as extensions of the previous stability for bilevel optimization. In the following, we present Theorem 1 as our first result, which shows a crucial relationship between generalization gap and algorithmic stability for an algorithm A. Theorem 1. Let A : ξ m1 × ζ m2 → R d1 × R d2 be a randomized BO algorithm. (a) If A is β-uniform-stable in expectation, then the following holds for D m1 ∼ D m1 1 , D m2 ∼ D m2 2 : E A,Dm 1 [R(A(D m1 , D m2 )) -R s (A(D m1 , D m2 ))] ≤ β. (b) If A is β-argument-stable in expectation and Assumption 2 holds, then the following holds for D m1 ∼ D m1 1 , D m2 ∼ D m2 2 : E A,Dm 1 [R(A(D m1 , D m2 )) -R s (A(D m1 , D m2 ))] ≤ L f β. (c) Assume that |f (x, y; ξ)| ≤ M for some M ≥ 0. If A is β-uniform-stable almost surely, then for D m1 ∼ D m1 1 , D m2 ∼ D m2 2 , the following holds with probability 1 -δ: |R(A(D m1 , D m2 )) -R s (A(D m1 , D m2 ))| ≤ 2β + e 4M √ m 1 log e δ + 12 √ 2β⌈log 2 m 1 ⌉ log e δ where e is the base of the natural logarithms. Remark 1. The above theorem suggests that the generalization gap can be controlled by several notions of algorithmic stability. Part (a) and Part (b) show that the expectation of generalization gap can be bounded by uniform stability and argument stability with the Lipschitz constant, respectively; Part (c) indicates that the generalization gap for the algorithm is no more than O(β log(m 1 ) +

1/

√ m 1 ) with probability 1 -δ. Compared with the existing work (Bao et al., 2021) , Theorem 1 considers argument stability additionally, which is a stronger notion of stability than uniform stability (since uniform stability can be deduced from argument stability with the condition that the function is sufficiently smooth). Moreover, we use the McDiarmid's inequality and the equivalence of tails and moments for the random variable with a mixture of sub-gaussian and sub-exponential tails (Lemma 1 in Bousquet et al. (2020) ), which provide a significantly improved high probability bound in Part (c) (i.e., improving from O(β √ m 1 ) in Bao et al. (2021) to O(β log m 1 )).

4. STABILITY ANALYSIS FOR BILEVEL OPTIMIZATION ALGORITHMS

Motivated by Theorem 1, we can see that to analyze the generalization behaviors for any algorithm, it is sufficient to analyze its stability. As mentioned in the previous Section 2.2, we will consider the stability of SSGD and TSGD. For simplicity we let SC-SC denote the case where f and g both are strongly convex functions. C-C, NC-NC, and NC-SC are also denoted in a similar manner with "C" representing convex function and "NC" representing nonconvex function.

4.1. STABILITY BOUNDS FOR SINGLE-TIMESCALE SGD

As we can see from Algorithm 1, SSGD updates y and x simultaneously. In the following we develop stability bounds for this algorithm in different settings. Theorem 2. Suppose that Assumptions 1 and 2 hold and Algorithm A is SSGD with K iterations: (a) Assume that Problem (1) is SC-SC with strongly convexity parameters µ f and µ g . Let α x = α y (see Lemma 9 for details) be the step sizes. Denote l = max{l f , l g }. Then, A is β-argument-stable in expectation, where β ≤ O L 2 f + L 2 g 1 2 m 1 µ f + µ g -(α x l) 2 /2 + 0.25 -1 . (b) Assume that Problem (1) is C-C. Let α x , α y be the step sizes. Then, A is β-argument-stable in expectation, where β ≤ O m -1 1 (α x L f ) 2 + (α y L g ) 2 2 + 2 max (α x l f ) 2 , (α y l g ) 2 K/2 . (c) Assume that Problem (1) is NC-NC. Let the step sizes satisfy max {α x , α y } ≤ c/k for some constant c ≥ 0 and l = max {l f , l g }. Then, A is β-argument-stable in expectation, where β ≤ O (m 1 cl) -1 2cL f l 2 f + l 2 g 1 cl+1 • K cl cl+1 , where l f , l g and L f , L g are smoothness constants and Lipschitz constants for f , g, respectively. Remark 2. Note that the above stability bounds are independent of the specific form of the objective function f (•, •) and the exact form of the sample distribution D 1 , which are more reliant on the properties of the loss functions and sample size m 1 , and the stability bounds in the C-C and NC-NC cases are related to the number of iterations additionally. Specifically, Part(a) establishes a stability bound of O(1/m 1 ) in the SC-SC setting and Part(b) considers a C-C case with a stability bound O(κ K/2 1 /m 1 ) related to the number of iterations and the data size, where κ 1 is a constant. The NC-NC case is discussed in Part(c) which provides a stability bound of O(K cl cl+1 /m 1 ), where c is a constant to control the step size and l is the larger smoothness number of l f and l g . The conclusions here match the existing results in minmax problems (Lei et al., 2021; Farnia & Ozdaglar, 2021) .

4.2. STABILITY BOUNDS FOR TWO-TIMESCALE SGD

Compared with the above SSGD, Two-timescale SGD (TSGD; Algorithm 2) always achieves more accurate approximate solutions by updating y multiple times before updating x. In this section, we extend our analysis from SSGD to TSGD. Particularly, compared with the results in Bao et al. (2021) , we provide stability bounds in Theorem 3 for the case where the inner level parameter (y) is updated iteratively (i.e., consistency). We further explore in Theorem 4 a particular NC-SC setting, which is commonly appeared in bilevel optimization applications such as meta learning and hyperparameter optimization. Theorem 3. Suppose that Assumptions 1 and 2 hold and |g(•, •)| ≤ 1. Let A be the TSGD algorithm with K outer-iterations and T inner-iterations. Then we have (a) Assume that Problem (1) is SC-SC. Let l = max{l f , 1+(αylg) 2 (1-αylg)αy } and α = α x = α y ≤ min{1/l g , 1/(µ f + µ g )} be the step sizes. Then, A is β-argument-stable in expectation, where β ≤ O m 1 -1 L 2 f α 2 x + 2T α y (2 -α y l g ) 2 (1 + αl) K . (b) Assume that Problem (1) is C-C. Let αl = max{α x l f , 1+(αylg) 2 1-αylg } and α x , α y ≤ 1 lg be the step sizes. Then, A is β-argument-stable in expectation, where β ≤ O m -1 1 L 2 f α 2 x + 2T α y (2 -α y l g ) 2 (1 + αl) K . (c) Assume that Problem (1) is NC-NC. Let the step sizes satisfy max {α x , α y } ≤ c/k for some constant c ≥ 0 and l = max {l f , l g }. Then, A is β-argument-stable in expectation, where β ≤ O (m 1 T cl) -1 2cL f l 2 f + T 2 l 2 g 1 T cl+1 • K T cl T cl+1 . Remark 3. Compared with the previous results for SSGD, the stability bounds of TSGD depend on the number of iterations in the outer level loop, the number of iterations in the inner level loop, and the data size in the outer level loop. If the step sizes are sufficiently small, we can see that the bounds in Theorem 3 are asymptotically the same as the bounds of SSGD in Theorem 2. Thus, Theorem 3 can be considered as a generalization of the previous one. The dependence on T also reveals our novelty compared with the existing work of stability analysis for other problems, such as simple SGD and minmax problems. To the best of our knowledge, this work provides the first stability bounds for the two-timescale (double loop) algorithms, which allows the accumulation of the sub-sampled gradients in the inner level. Remark 4. Comparing our results with the ones in (Bao et al., 2021) , we have the following observations. 1) They only established the uniform stability bound for the Unrolled Differentiation algorithm 3, where the algorithm is reinitialized at each time entering the inner level loop, indicating that it takes into account the changes to only one parameter in the outer level loop, while our algorithm considers the update for both parameters. 2) Its proof needs to assume that the update of y in the inner level after the reinitialization will not be affected by the value specified for x. However, this assumption is quite uncommon and is probably the reason that they do not need to make any assumption on the inner level objective function (see Appendix E in details). In contrast, our work allows the inner level parameters to be updated consistently (i.e., carrying over the value in the last update), instead of being reinitialized at each time entering the inner level loop. Specifically, we allow y T k to be employed at the beginning of the (k + 1)-th outer level iteration, rather than y 0 . This enables us to obtain different stability bounds for different inner level objective functions from a novel perspective. In the following, we extend our analysis to a particular NC-SC setting that is frequently encountered in real-world applications and optimization analysis. Theorem 4. Suppose that Assumptions 1 and 2 hold, 0 ≤ f (•, •) ≤ 1 and Problem (1) is NC-SC. Let A be the TSGD Algorithm with K outer-iterations and T inner-iterations with max {α x , α y } ≤ c/k for constant c ≥ 0. Denote l = max {l f , l g }. Then, A is β-uniform-stable in expectation, where β ≤ O    2cL f l 2 f + l 2 g T 2 1 c(T l+l-µg )+1 • K c(T l+l-µg ) c(T l+l-µg )+1 (T l + l -µ g + 2/c) m 1 (T l + l -µ g )    . Remark 5. Compared with our previous analysis, we now sketch the technique differences in our analysis. We consider the bound of the term (δ x,k , δ y,k ) T = (∥x k -x ′ k ∥, ∥y k -y ′ k ∥) T , while we employ δ k = ∥x k -x ′ k ∥ 2 2 + ∥y k -y k ′ ∥ 2 2 in the previous analysis, where (x k , y k ), (x ′ k , y ′ k ) are the outputs of TSGD after k iterations for D m1 and D ′ m1 respectively with D m1 and D ′ m1 differing in one sample. In the NC-SC setting, we show that (δ x,k+1 , δ y,k+1 ) T ≤ ((1 + α x l)δ x,k , (1 + α x T l)δ y,k ) T (≤ means the entry-wise inequality), which means our term can be controlled. Then, we take the expectation of it to derive our uniform stability bound. To achieve the generalization gap over continuously changing parameters, it is imperative to take into account the growth of (δ x,k , δ y,k ) instead of δ x,k in (Bao et al., 2021) . Appendix C.3 provides more details. Thus, based on our previous results, we now provide the first generalization bounds in the NC-NC setting for both SSGD and TSGD. Corollary 5. Assume that the problem is NC-NC, |f (•, •; ξ)| ≤ 1 for all ξ, and Assumptions 1 and 2 hold. Denote l = max{l f , l g } with max{α x , α y } ≤ c/k for constant c ≥ 0. Then, the generalization gap of SSGD 1 with K iterations is bounded by O(K cl cl+1 /m 1 ). Corollary 6. Assume that the problem is NC-NC, |f (•, •; ξ)| ≤ 1 for all ξ, and Assumptions 1 and 2 hold. Let l = max{l f , l g } with max{α x , α y } ≤ c/k. Then, the generalization gap of TSGD 2 with K outer iterations and T inner iterations is bounded by O(T 1 T cl+1 K 1-1 T cl+1 /(m 1 ) ). Remark 6. By Theorem [1, 2, 3], we can derive the above corollaries on generalization gap from stability bounds. Corollary 5 and Corollary 6 show that extremely high number of iterations (K for SSGD and K,T for TSGD) will drastically reduce the stability of these algorithms and increase the generalization gap, which will make these algorithms increase the risk of overfitting. We will also verify it in the following experiments.

5. EXPERIMENTS

In this section, we empirically validate our previous theoretical results on real world datasets. Two experiments, including meta-learning and hyperparameter optimization, are conducted via Algorithm 2 TSGD (note that when T = 1, TSGD is just SSGD). Due to the space limitation, we just present the meta learning experiment here, leaving the hyperparameter optimization experiment and other details in the Appendix D.

5.1. META LEARNING

Consider the few-shot meta-learning problem with M tasks {T i , i = 1, ..., M } sampled from distribution P T . We aim to learn a model that can rapidly adapt to different tasks. Firstly, the embedding model ϕ is shared by all tasks to learn embedded features. Secondly, the task-specific parameter w i is to adapt the shared embedding to its own sub-problem. Thus, the overall problem of meta-learning can be formulated as follow: min ϕ L D (ϕ, w * ) = E ξ∈D te i ,Ti [L (ϕ, w * i ; ξ)] , s.t. w * = arg min w L D tr (ϕ, w) = E Ti L D tr i (ϕ, w i ) . ( ) where D tr i and D te i are the training and testing datasets for task T i . Each w i is computed from one or more gradient descent updates from w on the corresponding task (rapid adaptation), i.e., w i = w -α∇ wL Dtr (ϕ, w i ). In the inner level, the base learner optimizes the series of w i for each tasks (Equation 3b). In the outer level, the meta-learner optimizes the embedding model ϕ using the minimizers w * i learned from the inner level and computes the loss from the testing dataset (Equation 3a).

Settings and Implementation

We evaluate the behavior of the 5-way-1-shot task on the Omnilot dataset (Lake et al., 2015) , i.e., it aims to classify 5 unseen classes from only 1 labeled sample. It contains 1623 different handwritten characters from 50 different alphabets. The image is in greyscale with a size 28 × 28. We follow similar settings in Ji et al. (2021) . A five-layer fully-connected network is constructed, where the task-specific parameter w i corresponds to the last layer of the Results Evaluation Figure 1 presents the learning curves on training set, testing set and the generalization gap with different values of inner iterations T and outer iterations K. Generalization gap is estimated by the difference between training and testing loss. On one hand, it can be seen that the model easily overfits on the testing set as K increases drastically (Figure 1b ) and the effect of T is very limited. On the other hand, with an appropriate value of K, smaller T (i.e) will result in underfitting on the testing loss (T = 1 in the Figure 1c causes highest generalization gap due to the underfitting training process). The trend of generalization gap in terms of K and T indicates that large values of iteration numbers will increase the risk of overfitting, which matches with our analysis in Theorem 4 that the stability of TSGD 2 will decrease drastically.

6. CONCLUSION

We give a thorough analysis on the generalization of first-order (gradient-based) methods for the bilevel optimization framework. In particular, we establish a quantitative connection between generalization and algorithmic stability and provide the first generalization bounds of the continuous updates for inner parameters and outer parameters in multiple settings. Our experiments suggest that inappropriate iterations will cause underfitting and overfitting easily. The tendency of generalization gap also validates our theoretical results. From the discussion in previous sections, we only discussed the first-order method, while there exist a number of estimating second-order and momentum-based approaches to solve the bilevel optimization problem. Dealing with the approximation of hypergradient in generalization analysis is another direction for future work.

A COMPARISON BETWEEN UD AND TSGD

Algorithm 3 Unrolled differentiation (UD) 1: Input: number of iterations K, step sizes α x , α y , initialization x 0 ,y 0 2: Output: x K , y K 3: for k = 0 to K -1 do 4: y 0 k ← y 0 5: for t = 0 to T -1 do 6: y t+1 k = y t k -α y ∇ y g(x k , y t k (x k ); D m2 ) 7: end for 8: x k+1 = x k -α x ∇f (x k , y T k (x k ); D m1 ) 9: end for 10: return x K , y T K Algorithm 4 Two-timescale SGD (TSGD) Input: number of iterations K, step sizes α x , α y , initialization x 0 ,y 0 , Datasets: D m1 , D m2 Output: x K , y K for k = 0 to K -1 do y 0 k ← y T k-1 for t = 0 to T -1 do y t+1 k = y t k -α y ∇ y g(x k , y t k (x k ); D m2 ) end for x k+1 = x k -α x ∇f (x k , y T k (x k ); D m1 ) end for return x K , y T K B PROOF OF PRELIMINARIES B.1 THE PROOF OF THEOREM 1 Proof of Part (a). Since ξ and ξ i are drawn from the same distribution, we know E A [R(A(D m1 , D m2 ), D 1 ) -R s (A(D m1 , D m2 ), D m1 )] = E A,ξi∈Dm 1 ,ξ∼D1 [f (A(D m1 , D m2 ), ξ) -f (A(D m1 , D m2 ), ξ i )] = E A,ξi∈Dm 1 ,ξ∼D1 [f (A(ξ, ξ 2 , .., ξ i-1 , ξ i+1 , ...ξ m1 , D m2 ), ξ i ) -f (A(D m1 , D m2 ), ξ i )] = E A,ξi∈Dm 1 ,ξ∼D1 [f (A(D ′ m1 , D m2 ), ξ i ) -f (A(D m1 , D m2 ), ξ i )] ≤ β, where D ′ m1 and D m1 differ in at most one sample ξ i .

Proof of Part (b)

. Similarly, we have E A [f (A(D m1 , D m2 ), D 1 ) -f (A(D m1 , D m2 ), D m1 )] = E A,ξiDm 1 ,ξ∼D1 [f (A(D m1 , D m2 ), ξ) -f (A(D m1 , D m2 ), ξ i )] = E A,ξi∈Dm 1 ,ξ∼D1 [f (A(ξ, ξ 2 , .., ξ i-1 , ξ i+1 , ...ξ m1 , D m2 ), ξ i ) -f (A(D m1 , D m2 ), ξ i )] = E A,ξi∈Dm 1 ,ξ∼D1 [f (A(D ′ m1 , D m2 ), ξ i ) -f (A(D m1 , D m2 ), ξ i )] ≤ E A,ξi∈Dm 1 ,ξ∼D1 [L f ∥A(D ′ m1 , D m2 ) -A(D m1 , D m2 )∥ ≤ L f β. To prove high probability bounds, we need the following lemma on the concentration behavior on the summation of weakly dependent random variables. Lemma 7 (Bousquet et al. 2020) . Let Z = (Z 1 , . . . , Z n ) be a vector of independent random variables with each taking values in Z, and g 1 , . . . , g n be some functions g i : Z n → R such that the following holds for any i ∈ [n] : • |E [g i (Z) | Z i ]| ≤ M a.s., • E g i (Z) | Z [n]\{i} = 0 a.s., • g i has a bounded difference β with respect to all variables except for the i-th variable. Then, for any p ≥ 2, n i=1 g i (Z) p ≤ 12 √ 2pnβ ⌈log 2 n⌉ + 4M √ pn, where the L p -norm of a random variable Z is denoted by ∥Z∥ p := (E[|Z| p ]) 1/p , p ≥ 1. Next, we state the following well-known relationship between tail bounds and moment bounds. Lemma 8 (Bousquet et al. 2020; Vershynin 2018) . Let a, b ∈ R + . Let Z be a random variable with ∥Z∥ p ≤ √ pa + pb and p ≥ 2. Then, for any δ ∈ (0, 1), we have, with probability at least 1 -δ |Z| ≤ e a log( e δ ) + b log( e δ ) . Proof of Part (c). In order to make use of Lemma 7 to obtain the generalization bounds, we will introduce: h i = E ξ ′ i ∼D1 [E ξi∼D1 [f (A(D i m1 , D m2 ; ξ))] -f (A(D i m1 , D m2 ; ξ i )], where D i m1 = {ξ 1 , ξ 2 , ..., ξ i-1 , ξ ′ i , ξ i+1 , . .., ξ m1 }, and ξ ′ i obeys identical distribution of ξ i . Hence, we have: |R(A(D m1 , D m2 ); D 1 ) -R s (A(D m1 , D m2 ); D m1 )| = 1 m 1 m1 i=1 (E ξ∼D1 f (A(D m1 , D m2 ); ξ) -f (A(D m1 , D m2 ); ξ i )) ≤ 1 m 1 m1 i=1 E ξ∼D1 f (A(D m1 , D m2 ); ξ) -E ξ∼D1,ξ ′ i ∼D1 f (A(D i m1 , D m2 ); ξ) + 1 m 1 m1 i=1 E ξ ′ i ∼D1 E ξ∼D1 f (A(D i m1 , D m2 ); ξ) -f (A(D i m1 , D m2 ); ξ i ) + 1 m 1 m1 i=1 E ξ ′ i ∼D1 f (A(D i m1 , D m2 ); ξ i ) -f (A(D m1 , D m2 ); ξ i ) . It then follows from the definition of uniform stability that |R(A(D m1 , D m2 ); D 1 ) -R s (A(D m1 , D m2 ); D m1 )| ≤2β + 1 m 1 m1 i=1 E ξ ′ i ∼D1 E ξ∼D1 f (A(D i m1 , D m2 ); ξ) -f (A(D i m1 , D m2 ; ξ i )) =2β + 1 m 1 m1 i=1 h i . Notice that all conditions of 7 hold. Thus, the following outcome can be derived for any p ≥ 2: m1 i=1 h i (ξ) p ≤ 12 √ 2pm 1 β ⌈log 2 m 1 ⌉ + 4M √ pm 1 . Combining Lemma 7 and Lemma 8 with h i defined above, we have the following inequality with probability 1 -δ: m1 i=1 h i (ξ) ≤ e 4M √ m 1 log e δ + 12 √ 2β ⌈log 2 m 1 ⌉ log e δ . The deviation bound now follows immediately: |R(A(D m1 , D m2 ); D 1 ) -R s (A(D m1 , D m2 ); D m1 )| ≤ 2β + e 4M √ m 1 log e δ + 12 √ 2β ⌈log 2 m 1 ⌉ log e δ . The proof is completed.

C MAIN PROOF C.1 APPROXIMATE EXPANSIVITY OF UPDATE RULES

With step size α x and α y , the update rules for single-timescale can be presented: G s x y := x -α x ∇f (x, y) y -α y ∇ y g(x, y) . Definition 5 (expansivity). An update rule is η-expansive if for every x, x ′ ∈ R d1 , y, y ′ ∈ R d2 : ∥G(x, y) -G (x ′ , y ′ )∥ 2 ≤ η ∥x -x ′ ∥ 2 2 + ∥y -y ′ ∥ 2 2 . Lemma 9. Suppose that Assumptions 1 and 2 hold for Problem (1). Then: 1. If f and g are non-convex functions, then G s is (1+max{l f α x , l g α y })-expansive with step size α x , α y . 2. If f and g are convex functions, then G s is ( 2 + 2 max{(l f α x ) 2 , (l g α y ) 2 })-expansive with step size α x , α y . 3. If f and g are strongly-convex with µ f and µ g respectively, then G s is 2 (1 -2α x (µ f + µ g ) + α x 2 l 2 )-expansive with step size: (u f + µ g ) -(u f + µ g ) 2 -0.5l 2 l 2 ≤ α x = α y ≤ min    1 µ f + µ g , (u f + µ g ) + (u f + µ g ) 2 -0.5l 2 l 2    . Proof. In Case 1 with the NC-NC objectives and the smoothness of objectives on Assumptions 1 and 2, we have G s x y -G s x ′ y ′ = x -x ′ -α x (∇f (x, y) -∇f (x ′ , y ′ )) y -y ′ + α y (∇ y g(x, y) -∇ y g (x ′ , y ′ )) ≤ x -x ′ y -y ′ + α x (∇f (x, y) -∇f (x ′ , y ′ )) α y (∇ y g(x, y) -∇ y g (x ′ , y ′ )) ≤ (1 + max{l f α x , l g α y }) x -x ′ y -y ′ . In case 2, with the monotonicity of the convex objective's gradient, we have: ⟨x -x ′ , α x (∇f (x, y) -∇f (x ′ , y))⟩ ≥ 0 ⟨y -y ′ , α y (∇ y g(x ′ , y) -∇ y g (x ′ , y ′ ))⟩ ≥ 0. Thus, the stated result then follows: G s x y -G s x ′ y 2 = x -x ′ y -y 2 -2 x -x ′ y -y T α x (∇f (x, y) -∇f (x ′ , y)) α y (∇ y g(x, y) -∇ y g (x ′ , y)) ] + α x (∇f (x, y) -∇f (x ′ , y)) α y (∇ y g(x, y) -∇ y g (x ′ , y)) 2 ≤ max{(l f α x ) 2 , (l g α y ) 2 } x -x ′ y -y 2 + ∥x -x ′ ∥ 2 . (4) and G s x ′ y -G s x ′ y ′ 2 = x ′ -x ′ y -y ′ 2 -2 x ′ -x ′ y -y ′ T α x (∇f (x ′ , y ′ ) -∇f (x ′ , y)) α y (∇ y g(x ′ , y ′ ) -∇ y g (x ′ , y)) ] + α x (∇f (x ′ , y) -∇f (x ′ , y ′ )) α y (∇ y g(x ′ , y) -∇ y g (x ′ , y ′ )) 2 ≤ max{(l f α x ) 2 , (l g α y ) 2 } x ′ -x ′ y -y ′ 2 + ∥y -y ′ ∥ 2 . (5) Combining the above equations 6, 7 and inequality ( k i=1 a k ) 2 ≤ k k i=1 a 2 k , we can derive the expansive of update rule G s under convexity condition: G s x y -G s x ′ y ′ 2 ≤ (2 + 2 max{(l f α x ) 2 , (l g α y ) 2 }) x -x ′ y -y ′ 2 . If f and g are strongly-convex, then, f (x, y) = f (x, y)- µ f 2 (∥x∥ 2 +∥y∥ 2 ) and g(x, y) = g(x, y)µg 2 (∥x∥ 2 + ∥y∥ 2 ) will be convex. With the above conclusions, we can derive the following: G T x y -G s x ′ y 2 = x -x ′ y -y 2 -2α x x -x ′ y -y T (∇f (x, y) -∇f (x ′ , y)) (∇ y g(x, y) -∇ y g (x ′ , y)) + α x 2 (∇f (x, y) -∇f (x ′ , y)) (∇ y g(x, y) -∇ y g (x ′ , y)) 2 = (1 -(α x µ f + α x µ g )) 2 x -x ′ y -y 2 + α x 2 (∇ f (x, y) -∇ f (x ′ , y)) (∇ y g(x, y) -∇ y g (x ′ , y)) 2 -2 (1 -α x µ f -α x µ g ) α x x -x ′ y -y T ∇ f (x, y) -∇ f (x ′ , y) (∇ y g(x, y) -∇ y g (x ′ , y)) ≤ 1 -2α x (µ f + µ g ) + α x 2 l 2 ∥x -x ′ ∥ 2 . The penultimate inequality arises from the smoothness of f , g, which is based on our assumption for simplicity that l = max{l f , l g }, and the details will be revealed as follows: l 2 x -x ′ y -y 2 ≥ ∇f (x, y) -∇f (x ′ , y) ∇ y g(x, y) -∇ y g (x ′ , y) 2 = ∇ f (x, y) -∇ f (x ′ , y) (∇ y g(x, y) -∇ y g (x ′ , y)) 2 + (µ f + µ g ) 2 x -x ′ y -y 2 + 2 (µ f + µ g ) x -x ′ y -y T ∇ f (x, y) -∇ f (x ′ , y) (∇ y g(x, y) -∇ y g (x ′ , y)) ≥ ∇ f (x, y) -∇ f (x ′ , y) (∇ y g(x, y) -∇ y g (x ′ , y)) 2 + (µ f + µ g ) 2 x -x ′ y -y 2 . Similar to the convex case, we can have: G T x y -G s x ′ y ′ 2 ≤ 2 1 -2α x (µ f + µ g ) + α x 2 l 2 x -x ′ y -y ′ 2 .

C.2 SINGLE TIMESCALE

We first introduce the following lemma before providing the proof of the Theorem. Lemma 10 (Hardt et al. (2016) ). Consider two sequences of updates G 1 s , ..., G K s and (G 1 s ) ′ , ..., (G K s ) ′ with initial points x 0 = x ′ 0 , y 0 = y ′ 0 . Define δ k = ∥x k -x ′ k ∥ 2 + ∥y k -y ′ k ∥ 2 . Then, we have: δ k+1 ≤        ηδ k if G k s = (G k s ) ′ is η-expansive min(η, 1)δ k + 2σ if sup x y -G x y ≤ σ G k s is η expansive Proof. The first part of the inequality is obvious from the definition of expansivity and the assumption of G k s = (G k s ) ′ . For the second bound, note that: δ k+1 = G s x k y k -G ′ s x ′ k y ′ k ≤ G s x k y k - x k y k + x ′ k y ′ k -G ′ s x ′ k y ′ k + x k -x ′ k y k -y ′ k ≤ δ k + G s x k y k - x k y k + G ′ s x ′ k y ′ k - x ′ k y ′ k ≤ δ k + 2σ. Also, δ k+1 can be further expressed as: δ k+1 = G s x k y k -G ′ s x ′ k y ′ k ≤ G s x k y k -G s x ′ k y ′ k + G s x ′ k y ′ k -G ′ s x ′ k y ′ k ≤ G s x k y k -G s x ′ k y ′ k + x ′ k y ′ k -G s x ′ k y ′ k + x ′ k y ′ k -G ′ s x ′ k y k ′ ≤ ηδ k + 2σ. Combining the above completes the proof of the Lemma 10. Now, we are ready to prove Theorem 2: Proof of Part(a). Suppose that D m1 and D ′ m1 are two neighboring sets differing only in one sample. Consider the updates G 1 s , ..., G K s and (G 1 s ) ′ , ..., (G K s ) ′ . We can observe that the example chosen by the algorithm is the same in D m1 , D ′ m1 at step k with probability 1-1/m 1 and different with probability 1/m 1 . In the former case, we have identical update rules, while 1 -2α x (µ f + µ g ) + α 2 x l 2 -expansive can be employed in the latter through lemma 10. E [δ k+1 ] ≤ 1 - 1 m 1 2 1 -2α x (µ f + µ g ) + α 2 x l 2 1/2 E [δ k ] + 1 m 1 E [δ k ] + 1 m 1 2 (α x L f ) 2 + (α x L g ) 2 ≤ 2 1 -2α x (µ f + µ g ) + α 2 x l 2 1/2 E [δ k ] + 2 m 1 (α x L f ) 2 + (α x L g ) 2 ≤ 2 (α x L f ) 2 + (α x L g ) 2 m 1 k i=0 2 1 -2α x (µ y + µ g ) + α 2 x l 2 i/2 ≤ 2 (α x L f ) 2 + (α x L g ) 2 m 1 ∞ i=0 2 1 -2α x (µ f + µ g ) + α 2 x l 2 i/2 (1) ≤ 2 (α x L f ) 2 + (α x L g ) 2 m 1 ∞ i=0 1 -2α x (µ f + µ g ) + α 2 x l 2 + 0.5 i = (α x L f ) 2 + (α x L g ) 2 m 1 α x (µ f + µ g ) - α 2 x l 2 2 + 0.25 = L 2 f + L 2 g m 1 (µ f + µ g -(α x l) 2 /2 + 0.25) . Here (1) comes from the mean equality √ ab ≤ (a + b)/2 for any a, b ≥ 0 and the assumption of (u f +µg)- √ (u f +µg) 2 -0.5l 2 l 2 ≤ α x ≤ (u f +µg)+ √ (u f +µg) 2 -0.5l 2 l 2 , which finishes the proof. Proof of Part(b). The proof of Part(b) is analogous to the above, thus we use the same notations for this part. E [δ k+1 ] ≤ 1 - 1 m 1 2 + 2 max l 2 f α 2 x , l 2 y α 2 y 1/2 E [δ k ] + 1 m 1 E [δ k ] + 2 m 1 L 2 f α 2 x + L 2 g α 2 y = 2 + 2 max l 2 f α 2 x , l 2 g α 2 y 1/2 E [δ k ] + 2 L 2 f α 2 x + L 2 g α 2 y m 1 E [δ k ] ≤ 2 L 2 f α 2 x + L 2 g α 2 y m 1 • 2 + 2 max l 2 f α 2 x , l 2 g α 2 y k+1 2 -1 2 + 2 max l 2 f α 2 x , l 2 g α 2 y -1 E [δ k ] ≤ O     L 2 f α 2 x + L 2 g α 2 y 2 + 2 max l 2 f α 2 x , l 2 g α 2 y k+1 2 m 1     . To prove stability in the NC-NC case, we introduce the following lemma: Lemma 11 (Hardt et al. (2016) ). Assume that f (x, y; ξ) is L f -Lipschitz continuous and 0 ≤ f (x, y; ξ) ≤ 1. Let D m1 and D ′ m1 be two datasets differing in only one sample. Denote (x K , y K ) and (x ′ K , y ′ K ) as the output of K steps of SSGD (single-timescale algorithm) on D m1 and D ′ m1 , respectively. Then, the following holds for every k ∈ {0, 1, ..., K}, where δ k = ∥x k -x ′ k ∥ 2 + ∥y k -y ′ k ∥ 2 : E [|f (x k , y k ; ξ) -f (x ′ k , y ′ k ; ξ)|] ≤ k 0 m 1 + L f E [δ k | δ k0 = 0] . Proof of Part(c). Applying Lemma 11, we get ready to prove the NC-NC case. Analogous to the previous case, we have: E [δ k+1 ] ≤ 1 - 1 m 1 1 + cl k E [δ k ] + 1 m 1 + cl k E [δ k ] + 2c l 2 f + l 2 g k = 1 + cl k E [δ k ] + 2c l 2 f + l 2 g m 1 k . The following can be derived: E [δ K | δ k0 = 0] ≤ K k=k0+1 T t=k+1 1 + cl t 2c l 2 f + l 2 g m 1 k ≤ K k=k0+1 T t=k+1 exp cl t 2c l 2 f + l 2 g m 1 k ≤ K k=k0+1 exp K t=k+1 cl t 2c l 2 f + l 2 g m 1 k ≤ k k=k0+1 exp(cl • log(K/k)) 2c l 2 f + l 2 g m 1 k ≤ 2c l 2 f + l 2 g m 1 K k=k0+1 k -cl-1 ≤ 2 l 2 f + l 2 g m 1 l K k 0 cl . Hence, Lemma 11 indicates: E [|f (x, y) -f (x ′ , y ′ )|] ≤ k 0 m 1 + 2L f l 2 f + l 2 g m 1 l K k 0 cl . The right hand side is approximately minimized when k 0 = 2cL f l 2 f + l 2 g 1 cl+1 • K cl cl+1 . Therefore, we have β ≤ O    2cL f l 2 f + l 2 g 1 cl+1 • K cl cl+1 m 1 cl    for argument stability. C.3 TWO-TIMESCALE SGD (TSGD) C.3.1 STANDARD SETTINGS With step size α x and α y , the update rule for two-timescale can be presented as: G T x k y k := x k -α x ∇f (x k , y T k ) y k -α y T t=1 ∇ y g(x k , y t k ) . Analogous to the single-timescale case, we first provide the expansivity of the update rules. Lemma 12. Suppose that Assumptions 1 and 2 hold for Problem (1). Let αl = max{α x l f , 1+(αylg) 2 1-αylg } for simplicity sake and assume α y l g ≤ 1. Then: 1. If f and g are non-convex functions, G T is (1 + αlT )-expansive. 2. If f and g are convex functions, G T is (1 + αl)-expansive with step size α x , α y . 3. If f and g are strongly-convex with µ f and µ g respectively, G T is 1 + αl-expansive with step size: α x = α y ≤ 1 µ f + µ g . Proof. In Case 1 with the NC-NC objectives by the triangle inequality, we have: G T x y -G T x ′ y ′ ≤ G T x y -G T x ′ y + G T x ′ y -G T x ′ y ′ The first item can be derived from: G T x y -G T x ′ y = x -x ′ -α x (∇f (x, y) -∇f (x ′ , y)) y -y + α y T y=1 (∇ y g(x, y t ) -∇ y g (x ′ , y t )) ≤ (1 + α y T l g ) ∥x -x ′ ∥ The second item can be derived from: G T x ′ y -G T x ′ y ′ = x ′ -x ′ -α x (∇f (x ′ , y) -∇f (x ′ , y ′ )) y -y ′ + T -1 t=0 α y ∇ y g(x ′ , y t ) -∇ y g x ′ , y t ′ ≤ x ′ -x ′ y -y ′ + α x (∇f (x ′ , y) -∇f (x ′ , y ′ )) T -1 t=0 α y ∇ y g(x ′ , y t ) -∇ y g x ′ , y t ′ From the Lipschitz continuous, we have: T -1 t=0 α y ∇ y g x, y t -∇ y g x, y t ≤ T -1 t=0 α y l g y t -y t Now we consider the t-th update: α y l g y t -y t = α y l g y t-1 -α y ∇ y g x ′ , y t-1 -y t-1 + α y ∇ y g x ′ , y t-1 ≤ α y l g y t-1 -y t-1 ′ + (α y l g ) 2 y t-1 -y t-1 ′ • • • ≤ (α y l g ) t y 0 -y 0 ′ + (α y l g ) t+1 y 0 -y 0 ′ According to the accumulation of the both side, we have: T -1 t=0 α y l g y t -y t ′ ≤ α y l g y 0 -y 0 ′ ∥ T -1 t=1 (α y l g ) t y 0 -y 0 ′ + (α y l g ) t+1 y 0 -y 0 ′ = 1 -(α y lg) T 1 -α y lg + (α y l g ) 2 -(α y lg) T +1 1 -α y l g y 0 -y 0 ′ = 1 -(α y l g ) T + (α y l g ) 2 -(α y l g ) T +1 1 -α y l g y 0 -y 0 ′ ≤ 1 + (α y l g ) 2 1 -α y l g y -(y) ′ Let αl = max{α y l g , 1+(αylg) 2 1-αylg }, then: G T x y -G T x ′ y ′ ≤ (1 + T αl) x -x ′ y -y ′ . In case 2, with the monotonicity of the convex objective's gradient, we have: ⟨x -x ′ , α x (∇f (x, y) -∇f (x ′ , y))⟩ ≥ 0 ⟨y -y ′ , α y (∇ y g(x ′ , y) -∇ y g (x ′ , y ′ ))⟩ ≥ 0. Thus, the stated result then follows: G T x y -G T x ′ y 2 = x -x ′ y -y 2 -2 x -x ′ y -y T α x (∇f (x, y) -∇f (x ′ , y)) T -1 t=0 α y ∇ y g(x, y t ) -∇ y g x, y t ′ + α x (∇f (x, y) -∇f (x ′ , y)) T -1 t=0 α y ∇ y g(x, y t ) -∇ y g x, y t ′ 2 ≤ max    (l f α x ) 2 , 1 + (α y l g ) 2 1 -α y l g 2    x -x ′ y -y 2 + ∥x -x ′ ∥ 2 . (6) and the second decomposition can be obtained by the NC-NC case: G T x ′ y -G T x ′ y ′ ≤ 1 + max{l f α x , 1 + (α y l g ) 2 1 -α y l g } ∥y -y ′ ∥ . let αl = max{α x l f , 1+(αylg) 2 1-αylg }. Combining the above equations 6, 7 and inequality 1 + (αl) 2 ≤ (1 + αl) 2 , then we can derive the expansive of update rule G T under convexity condition: G T x y -G T x ′ y ′ ≤ (1 + αl) x -x ′ y -y ′ . If f and g are strongly-convex, then, f (x, y) = f (x, y)- µ f 2 (∥x∥ 2 +∥y∥ 2 ) and g(x, y) = g(x, y)µg 2 (∥x∥ 2 + ∥y∥ 2 ) will be convex. Let α x = α y = α and denote αl = max{α x l f , 1+(αylg) 2 1-αylg }, we can derive the following with the conclusions from the convex case: G T x y -G T x ′ y 2 = x -x ′ y -y 2 -2α x x -x ′ y -y T (∇f (x, y) -∇f (x ′ , y)) T -1 t=0 ∇ y g(x, y t ) -∇ y g x, y t ′ + α x 2 (∇f (x, y) -∇f (x ′ , y)) T -1 t=0 ∇ y g(x, y t ) -∇ y g x, y t ′ 2 = (1 -(α x µ f + α x µ g )) 2 x -x ′ y -y 2 + α x 2 (∇ f (x, y) -∇ f (x ′ , y)) T -1 t=0 ∇ y g(x, y t ) -∇ y g x, y t ′ 2 -2 (1 -α x µ f -α x µ g ) α x x -x ′ y -y T   ∇ f (x, y) -∇ f (x ′ , y) T -1 t=0 ∇ y g(x, y t ) -∇ y g x, y t ′   ≤ 1 -2α (µ f + µ g ) + α 2 l 2 ∥x -x ′ ∥ 2 . The penultimate inequality arises from the smoothness of f , g, which is based on our assumption for simplicity that l = max{l f , 1+(αylg) 2 (1-αylg)αy }, and the details will be revealed as follows: l 2 x -x ′ y -y 2 ≥ ∇f (x, y) -∇f (x ′ , y) T -1 t=0 ∇ y g(x, y t ) -∇ y g x, y t ′ 2 =   ∇ f (x, y) -∇ f (x ′ , y) T -1 t=0 ∇ y g(x, y t ) -∇ y g x, y t ′   2 + (µ f + µ g ) 2 x -x ′ y -y 2 + 2 (µ f + µ g ) x -x ′ y -y T   ∇ f (x, y) -∇ f (x ′ , y) T -1 t=0 ∇ y g(x, y t ) -∇ y g x, y t ′   ≥   ∇ f (x, y) -∇ f (x ′ , y) T -1 t=0 ∇ y g(x, y t ) -∇ y g x, y t ′   2 + (µ f + µ g ) 2 x -x ′ y -y 2 . Similar to the convex case, we can have: G s x y -G s x ′ y ′ ≤ (1 + αl) x -x ′ y -y ′ . Proof. Because the main proof of Lemma 12 is similar to that of Lemma 9, we omit it. Next, we give a bound for the update rule G T and prepare to prove Theorem 3. Since g() is a l g -smooth function, we have: g x, y t+1 ≤ g x, y t + ∇g x, y t , y t+1 -y t + l g 2 y t+1 -y t 2 . ≤ g x, y t -∇g x, y t , α y ∇g x, y t + l g 2 α y ∇g x, y t 2 ≤ g x, y t -α y 1 -α y l g 2 ∇g x, y t 2 . The two sides are accumulated from t = 1 to t = T and we could derive the following by Cauchy-Schwarz inequality: T t=1 ∇g x, y t 2 ≤ g (x, y 1 ) -g (x, y T ) α y (2 -α y l g ) ⇒ T i=1 ∇g(x, y t ) 2 ≤ T T i=1 ∇g 2 (x, y t ) ≤ T (g (x, y 1 ) -g (x, y T )) α y (2 -α y l g ) . Hence, the bound of G T equals to L 2 f α 2 x + T (g(x,y1)-g(x,y T )) αy(2-αylg)

2

. Now, we are ready to give the proof of Theorem 3. Proof of Part(a). Suppose that D m1 and D ′ m1 are two neighboring sets differing in only one sample. Consider the updates G 1 T , ..., G K T and (G 1 T ) ′ , ..., (G K T ) ′ . We can observe that the example chosen by algorithm is the same in D m1 , D ′ m1 at step k with probability 1 -1/m 1 and different with probability 1/m 1 . Similarly to the previous single-timescale methods, in the former case, we have identical update rules, while (1 + αl)-expansive can be employed in the latter through lemma 10. E [δ k+1 ] ≤ 1 - 1 m 1 (1 + αl)E [δ k ] + 1 m 1 E [δ k ] + 2 m 1 L 2 f α 2 x + T (g (x, y 1 ) -g (x, y T )) α x (2 -α x l g ) 2 = (1 + αl)E [δ k ] + 2 m 1 L 2 f α 2 x + T (g (x, y 1 ) -g (x, y T )) α x (2 -α x l g ) 2 E [δ k ] ≤ 2 m 1 L 2 f α 2 x + T (g (x, y 1 ) -g (x, y T )) α x (2 -α x l g ) 2 • (1 + αl) k -1 1 -1 + α E [δ k ] ≤ O   L 2 f α 2 x + ( 2T αy(2-αylg) ) 2 (1 + αl) k m 1   . The proof of Part(b) is the same as Part(a) and Part(c) are analogous to their counterparts in the single-timescale case. Thus, we omit them here.

C.3.2 PARTICULAR SETTING

We introduce the following lemma as an extension of expansivity for the particular NC-SC setting. Lemma 13. Consider Problem (1) in the NC-SC setting and assume that Assumptions 1 and 2 hold. Suppose that (x, y) and (x ′ , y ′ ) are produced by Algorithm 2 with step size α x = α y . Let αl = max{α x l f , α y l g }. Then, we have the following expansivity equality: G T x y -G T x ′ y ′ ≤ 1 + α x lT 0 0 1 -α x µ g + α x lT ∥ x -x ′ ∥ ∥ y -y ′ ∥ . Proof. By the triangle inequality, we have: G T x y -G T x ′ y ′ ≤ G T x y -G T x ′ y + G T x ′ y -G T x ′ y ′ . The first item can be derived from: G T x y -G T x ′ y = x -x ′ -α x (∇f (x, y) -∇f (x ′ , y)) y -y + α y T y=1 (∇ y g(x, y t ) -∇ y g (x ′ , y t )) ≤ (1 + αT l) ∥x -x ′ ∥ . If g is strongly-convex, then g(x, y) = g(x, y) -µg 2 (∥x∥ 2 + ∥y∥ 2 ) will be convex. With the monotonicity of g gradient,, we can derive the following: G T x ′ y -G T x ′ y ′ 2 = x ′ -x ′ y -y ′ 2 -2α x x ′ -x ′ y -y ′ T (∇f (x ′ , y) -∇f (x ′ , y ′ )) (∇ y g(x ′ , y) -∇ y g (x ′ , y ′ )) + α x 2 (∇f (x ′ , y) -∇f (x ′ , y ′ )) (∇ y g(x ′ , y) -∇ y g (x ′ , y ′ )) 2 = (1 -α x µ g ) 2 x ′ -x ′ y -y ′ 2 + α x 2 (∇f (x ′ , y) -∇f (x ′ , y ′ )) (∇ y g(x ′ , y) -∇ y g (x ′ , y ′ )) 2 -2 (1 -α x µ g ) α x x ′ -x ′ y -y ′ T (∇f (x ′ , y) -∇f (x ′ , y ′ )) (∇ y g(x ′ , y) -∇ y g (x ′ , y ′ )) ≤ ((1 -α x µ g ) 2 + α x 2 l 2 T 2 )∥y -y ′ ∥ 2 . Hence, the second item can be derived G T x ′ y -G T x ′ y ′ ≤ (1 -α x µ g + α x lT )∥y -y ′ ∥. Next, we consider the extension of the growth lemma: Lemma 14. Consider two sequences of updates G 1 T , ..., G K T and (G 1 T ) ′ , ..., (G K T ) ′ with initial points x 0 = x ′ 0 , y 0 = y ′ 0 . Define δ x,k = ∥x k -x ′ k ∥ and δ y,k = ∥y k -y ′ k ∥. Suppose that G k T is η-expansive and for every (x G k T , y G k T ) := G k T (x, y), (x (G k T ) ′ , y (G k T ) ′ ) := (G k T ) ′ (x, y) and sup x,y x G k T -x ≤ σ x , sup x,y y G k T -y ≤ σ y , sup x,y x (G k T ) ′ -x ≤ σ x , sup x,y y (G k T ) ′ -y ≤ σ y . Then, we have δ x,k+1 δ y,k+1 ≤ η δ x,k δ y,k + 2 σ x σ y . Now, we are ready to give the proof of Theorem 4. Proof. E [δ x,k+1 ] E [δ y,k+1 ] ≤ 1 - 1 m 1 1 + cl k 0 0 1 + c k (T l -µ g ) + 1 m 1 1 + cl k 0 0 1 + c k (T l -µ g ) E [δ x,k ] E [δ ] + 2cl f k 2clgT k ≤ 1 + cl k 0 0 1 + c k (T l -µ g ) E [δ x,k ] E [δ y,k ] + 2cl f m1k 2clgT m1k ≤ E [δ x,K ] E [δ y,K ] ≤ t K k=k0+1 K t=k+1 1 + cl k 0 0 1 + c k (T l -µ g ) 2cl f m1k 2clgT m1k ≤ K k=k0+1   exp K t=k+1 cl k 0 0 exp K t=k+1 c k (T l -µ g )   2cl f m1k 2clgT m1k ≤ 2c l 2 f + T 2 l 2 m 1 K k=k0+1 exp k t=k+1 c k (T l + l -µ g ) k ≤ 2c l 2 f + T 2 l 2 g K -c(T l+l-µg) m 1 K k=k0+1 k -c(T l+l-µg)-1 ≤ 2 l 2 f + T 2 l 2 g m 1 (T l + l -u g ) • K k 0 c•(T l+l-µg) . According to Lemma 11, we have: E [|f (x, y) -f (x ′ , y ′ )|] ≤ k 0 m 1 + 2L f l 2 f + T 2 l 2 g m 1 (T l + l -µ g ) K k 0 c(T l+l-µg) . Let p = T l + l -µ g . The right hand side is approximately minimized when k 0 = 2cL f l 2 f + T 2 l 2 g 1 cp+1 • K cp cp+1 . Therefore, we have β ≤ O    2cL f l 2 f + l 2 g T 2 1 cp+1 • K cp cp+1 (p + 2/c) m 1 p    .

C.4 THE PROOF OF COROLLARY

Proof of Corollary 5. Based on the proof of previous result in C.2, we have: E [|f (x, y) -f (x ′ , y ′ )|] ≤ k 0 m 1 + 2L f l 2 f + l 2 g m 1 l K k 0 cl . The right hand side is approximately minimized when k 0 = 2cL f l 2 f + l 2 g 1 cl+1 • K cl cl+1 . Therefore, we have E [|f (x, y) -f (x ′ , y ′ )|] ≤ O    2cL f l 2 f + l 2 g cl+1 • K cl cl+1 (l + 2/c) m 1 l    = O(K cl cl+1 /m 1 ). Proof of Corollary 6. Based on the previous result, we have: E [|f (x, y) -f (x ′ , y ′ )|] ≤ k 0 m 1 + 2L f l 2 f + T 2 l 2 g m 1 T l K k 0 T cl . The right hand side is approximately minimized when k 0 = 2cL f l 2 f + l 2 g 1 T cl+1 • K T cl T cl+1 . Therefore, we have E [|f (x, y) -f (x ′ , y ′ )|] ≤ O    2cL f l 2 f + T 2 l 2 g 1 T cl+1 • K T cl T cl+1 (T l + c/2) m 1 T l    = O T 1 T cl+1 K 1-1 T cl+1 m 1 .

D ADDITIONAL EXPERIMENT D.1 META LEARNING

We also conduct other experiment with single timescale (Algorithm 1) on meta learning to validate our theoretical findings. The model overfits with large value of K, which matches our theorem 2. An additional experiment on the MNIST dataset is conducted for Theorem 4, following the same settings on Omnilot dataset. Both Figure 1c and Figure 3c validate the influence of K and T on the generalization gap. When K is relatively small, T is dominant since the gap with T = 8 is higher than that with T = 2, 4. When K is large, The effect of T fades and it contributes less to the trend of the gap.  where D tr and D val are the training and validation sets and R w,λ is the regularizer. In the inner level, the procedure optimizes w using the training set (Equation 8b). In the outer level, it optimizes λ using the validation set (Equation 8a).

Settings and Implementation

We adopt the task of data hyper-cleaning. It aims to reweight data samples with the noisy label. Therefore, the hyperparameter λ is the weight of each sample in the training set. We follow a similar setting in Bao et al. (2021) on the MNIST dataset (LeCun, 1998) , which consists of greyscale hand-written digits with size 28 × 28. The model w corresponds to the classification network and the hyperparameter λ corresponds to the weights of each individual training sample. We establish the experiment using PyTorch (Paszke et al., 2019) . The model w is a 2-layer fully connected network with size 784 → 256 → 10 for the 10 digit classification. The hyperparameter λ is a weighting vector with length 2000 for all training samples. We randomly sample 2000, 1000 and 1000 images for the training, validation and testing set. Training samples are corrupted with the probability of 50%, i.e., roughly half samples are labeled with random and wrong values, instead of the correct ones. We train w using the training set in the inner level and λ using the validation set in the outer level. The batch size is 32 and the learning rate for w and λ are 0.01 and 10, respectively. Results are evaluated based on the average of 5 trial runs with different random seeds. Results Evaluation Figure 4 demonstrates the results of Algorithm 2 on the regularized HO problem where the inner level function is often strongly-convex. It is clearly shown in Figure 4b that T = 8 causes the increase of testing loss and indicates that the risk of model overfitting increases as T rises. Additionally, the generalization gap maintains a consistent relationship with both inner and outer iterations, T and K respectively, which is corresponding to our Theorem 4. We also conduct the experiment with smaller learning rate 0.001 and larger number of steps {64, 128, 256} in the inner level. Compared to Figure 4 , Figure 5 presents similar behavior in terms of the effect of the value of K and T , while Figure 5c performs a higher variance caused by the accumulation of inner level updates. Furthermore, we can show that performance on the test dataset is comparable to performance with smaller inner iteration even when inner iteration T substantially increases, suggesting the effectiveness of TSGD and validating our Theorem 4. 



Figure 1: Results of meta learning with various values of T and K

Figure 2: Results of Meta Learning with single timescale optimization

Figure 3: Results of Meta Learning with single timescale optimization

Figure 4: Results of hyperparameter optimization with various values of T and K

Figure 5: Results of hyperparameter optimization with large values of T

E EXISTING GAP IN THE ANALYSIS OF UD

The proofs in (Bao et al., 2021) make us suspect that θ(λ) is more or less treated as an argument independent of λ, even though in the description θ(λ) is said to be dependent on λ.In the proof of Theorem 2 (Appendix A.2 of (Bao et al., 2021) , page 14), the following equations are given,This implies that θ (λ t , S tr ) = θ and θ (λ ′ t , S tr ) = θ, which seems to suggest that θ(λ) is independent of λ which may conflict with the dependence on λ. Suppose (Bao et al., 2021) uses improper notations here, but in their following proof, there is still something confusing:To understand the issue better, let δ t = ∥λ t -λ ′ t ∥. Suppose that 0 ≤ t 0 ≤ t. They have the following inequality in the proof.The left hand side is to measure the expected difference of function f with arguments (λ t , θ). The first equation decomposes the left hand side according to the two possible cases of δ t0 (i.e., δ t0 = 0 or δ t0 > 0). The first term of the inequality is derived from the Lipschitz continuous property of f . However, to use the Lipschitz continuity of the multivariate function f with respect to θ, i.e.,, θ needs to be the same varible (i.e., they have the same value all the time) in both f (λ t , θ) and f (λ ′ t , θ). However, from the UD algorithm in (Bao et al., 2021) it is clear that θ will not always have the same value in f (λ t , θ) and f (λ ′ t , θ) when λ changes.Specifically, when t = t 0 , δ t0 = 0, λ t0 = λ ′ t0 , we have θt0+1 K (λ t0 , S tr ) = θ′ t0+1 K (λ ′ t0 , S tr ). Since S val and S ′ val are assumed to differ by at most one point, without loss of generality, we suppose that SGD selects the different point at timestep t s . Then, we haveThis means that θts+1 K (λ ts , S tr ) ̸ = θ′ ts+1 K (λ ′ ts , S tr ) for all t ≥ t s , as the update of θ and λ use the value of θ in the previous iteration. Thus, we cannot use the Lipschitz property to derive the first term in the aforementioned inequality. That is why we think there may exist some gap in the analysis of UD algorithms in (Bao et al., 2021) .

